Recording data access

When working with huge data repositories it can be worthwhile to record the subset of data used for a given analysis. ONE can keep track of which datasets were loaded via the load_* methods.

Only datasets that were successfully loaded are recorded; missing datasets are ignored.

How to set up and save

At the top of your analysis script, after instantiating ONE, simply set the record_loaded attribute to True:

one.record_loaded = True

At the end of your analysis script, you can save the data by calling one.save_loaded_ids(). By default this will save the dataset UUIDs to a CSV file in the root of your cache directory and will clear the list of dataset UUIDs. The sessions_only kwarg will save the eids instead.

Note.

Within a Python session, calling ONE again with the same arguments (from any location) will return the previous object, therefore if you want to stop recording dataset UUIDs you must explicitly set record_loaded to False, e.g. ONE().record_loaded = False.

Example

[ ]:
import pandas as pd
from one.api import ONE
one = ONE(base_url='https://openalyx.internationalbrainlab.org')

# Turn on recording of loaded dataset UUIDs
one.record_loaded = True

# Load some trials data
eid = 'KS023/2019-12-10/001'
dsets = one.load_object(eid, 'trials')

# Load another dataset
eid = 'CSHL049/2020-01-08/001'
dset = one.load_dataset(eid, 'probes.description')

# Save the dataset IDs to file
dataset_uuids, filename = one.save_loaded_ids(clear_list=False)
print(filename)
print(pd.read_csv(filename), end='\n\n')

# Save the session IDs
session_uuids, filename = one.save_loaded_ids(sessions_only=True)
print(filename)
print(pd.read_csv(filename))

F:\FlatIron\openalyx.internationalbrainlab.org\2022-02-24T13-37-07_loaded_dataset_uuids.csv
                            dataset_uuid
0   0bc9607d-0a72-4c5c-8b9d-e239a575ff67
1   16c81eaf-a032-49cd-9823-09c0c7350fd2
2   2f4cc220-55b9-4fb3-9692-9aaa5362288f
3   4ee1110f-3ff3-4e26-87b0-41b687f75ce3
4   63aa7dea-1ee2-4a0c-88bc-00b5cba6b8b0
5   69236a5d-1e4a-4bea-85e9-704492756848
6   6b94f568-9bb6-417c-9423-a84559f403d5
7   82237144-41bb-4e7f-9ef4-cabda4381d9f
8   91f08c6d-7ee0-487e-adf5-9c751769af06
9   b77d2665-876e-41e7-ac57-aa2854c5d5cd
10  c14d8683-3706-4e44-a8d2-cd0e2bfd4579
11  c8cd43a7-b443-4342-8c37-aa93a2067447
12  d078bfc8-214d-4682-8621-390ad74dd6d5
13  d11d7b33-3a96-4ea6-849f-5448a97d3fc1
14  d73f567a-5799-4051-9bc8-6f0fd6bb478b
15  e1793e9d-cd96-4cb6-9fd7-a6b662c41971
16  fceb8cfe-77b4-4177-a6af-44fbf51b33d0

F:\FlatIron\openalyx.internationalbrainlab.org\2022-02-24T13-37-07_loaded_session_uuids.csv
                           session_uuid
0  4b7fbad4-f6de-43b4-9b15-c7c7ef44db4b
1  aad23144-0e52-4eac-80c5-c4ee2decb198

Data reproducibility

The Alyx database may be periodically updated with revised datasets. Data revisions occur when a session is preprocessed with a better algorithm. Typically the newest revision will be considered the default one to load. Therefore, to ensure the results of an analysis don’t change unexpectedly, either of the below two methods can be used to ‘freeze’ the data.

Saving the data access tables

When search and load queries are made, the results are stored in memory. These access tables can be saved to disk then used by ONE in local mode. First, run your analysis code with the various search and load queries to build up the access tables in memory, then save them using the save_cache method.

Note.

Unlike the record_loaded option above, these access tables contain the results of all session and dataset queries, not just a list of loaded datasets.

Example 1: Saving the access tables

[ ]:
from one.api import ONE
one = ONE()

... # Run one.search and one.load_* methods here

# Save the tables to disk
one.save_cache()

# To use these tables in a new session, initialize ONE in local mode
one = ONE(mode='local')

When the access tables are saved in the default location they will be loaded by default each time ONE is instantiated, however they won’t be used unless either mode is set to ‘local’, or a method is called with query_type='local'. The access tables can be reset at any time by calling one.reset_cache. Additionally, when saving, if tables already exist on disk, they will be merged and updated. To fully overwrite the tables on disk, use clobber=True in the save_cache method.

Example 2: Save tables to different location

[ ]:
from one.api import ONE
one = ONE()

... # Run one.search and one.load_* methods here

# Save the tables to disk, overwriting any existing tables
one.save_cache(one.cache_dir / 'my_analysis_tables', clobber=True)

# To use these tables in a new session, initialize ONE in local mode
one = ONE(mode='local')  # tables_dir may also be passed as a kwarg here
# The tables must be manually loaded as they are not in the default location
one.load_cache(one.cache_dir / 'my_analysis_tables')

The saved access tables are .pqt files that can also be manually shared with other users.

Note.

If a dataset is deleted from the database it will only be loadable if the file still exists locally.

Setting a revision date

Another way to reproduce data access on a certain date is to use the ONE_REVISION_LAST_BEFORE environment variable. With this variable set to an ISO formatted date, any datasets with a revision newer than this date are ignored. Similarly, a user can pass this date into the load methods using the revision kwarg.

Example:

[1]:
import os
os.environ['ONE_REVISION_LAST_BEFORE'] = '2024-04-01'

from one.api import ONE
one = ONE()

eid = 'b52182e7-39f6-4914-9717-136db589706e'
dsets = one.list_datasets(eid, filename='drift.times.npy', collection='alf/probe00/pykilosort')
print(dsets)  # shows at least two versions of this dataset

file = one.load_dataset(eid, 'drift.times.npy', collection='alf/probe00/pykilosort', download_only=True)
print(file.relative_to_session())  # shows the older version of this dataset
['alf/probe00/pykilosort/#2024-05-06#/drift.times.npy', 'alf/probe00/pykilosort/drift.times.npy']
alf\probe00\pykilosort\drift.times.npy

Warning.

If a dataset’s QC is set to CRITICAL it will no longer be loadable, even with the ONE_REVISION_LAST_BEFORE. Additionally, if a dataset is removed from the database it will be no longer loadable. In these cases, saving data access tables is more robust.