one.alf.cache

Construct Parquet database from local file system.

NB: If using a remote Alyx instance it is advisable to generate the cache via the Alyx one_cache management command, otherwise the resulting cache UUIDs will not match those on the database.

Examples

>>> from one.api import One
>>> cache_dir = 'path/to/data'
>>> make_parquet_db(cache_dir)
>>> one = One(cache_dir=cache_dir)

Module attributes

`QC_TYPE`	The cache table QC column data type.
`SESSIONS_COLUMNS`	A map of sessions table fields and their data types.
`DATASETS_COLUMNS`	A map of datasets table fields and their data types.
`EMPTY_DATASETS_FRAME`	An empty datasets dataframe with correct columns and dtypes.
`EMPTY_SESSIONS_FRAME`	An empty sessions dataframe with correct columns and dtypes.

Functions

`cast_index_object`	Cast the index object to the specified dtype.
`default_cache`	Returns an empty cache dictionary with the default tables.
`load_tables`	Load parquet cache files from a local directory.
`make_parquet_db`	Given a data directory, index the ALF datasets and save the generated cache tables.
`merge_tables`	Update the cache tables with new records.
`patch_tables`	Reformat older cache tables to comply with this version of ONE.
`remove_missing_datasets`	Remove dataset files and session folders that are not in the provided cache.
`remove_table_files`	Delete cache tables on disk.

make_parquet_db(root_dir, out_dir=None, hash_ids=True, hash_files=False, lab=None)[source]

Given a data directory, index the ALF datasets and save the generated cache tables.

Parameters:

root_dir (str, pathlib.Path) – The file directory to index.
out_dir (str, pathlib.Path) – Optional output directory to save cache tables. If None, the files are saved into the root directory.
hash_ids (bool) – If True, experiment and dataset IDs will be UUIDs generated from the system and relative paths (required for use with ONE API).
hash_files (bool) – If True, an MD5 hash is computed for each dataset and stored in the datasets table. This will substantially increase cache generation time.
lab (str) – An optional lab name to associate with the data. If the folder structure contains ‘lab/Subjects’, the lab name will be taken from the folder name.

Returns:

pathlib.Path – The full path of the saved sessions parquet table.
pathlib.Path – The full path of the saved datasets parquet table.

load_tables(tables_dir, glob_pattern='*.pqt')[source]

Load parquet cache files from a local directory.

Parameters:

tables_dir (str, pathlib.Path) – The directory location of the parquet files.
glob_pattern (str) – A glob pattern to match the cache files.

Returns:

A Bunch object containing the loaded cache tables and associated metadata.

Return type:

Bunch

patch_tables(table: DataFrame, min_api_version=None, name=None) → DataFrame[source]

Reformat older cache tables to comply with this version of ONE.

Currently this function will 1. convert integer UUIDs to string UUIDs; 2. rename the ‘project’ column to ‘projects’; 3. add QC column; 4. drop session_path column.

Parameters:

table (pd.DataFrame) – A cache table (from One._cache).
min_api_version (str) – The minimum API version supported by this cache table.
name ({'dataset', 'session'} str) – The name of the table.

merge_tables(cache, strict=False, origin=None, **kwargs)[source]

Update the cache tables with new records.

Note: A copy of the tables in cache may be returned if the original tables are immutable. This can happen when tables are loaded from a parquet file.

Parameters:

dict – A map of cache tables to update.
strict (bool) – If not True, the columns don’t need to match. Extra columns in input tables are dropped and missing columns are added and filled with np.nan.
origin (str) – The origin of the cache (e.g. a computer name or database name).
kwargs – pandas.DataFrame or pandas.Series to insert/update for each table.

Returns:

A timestamp of when the cache was updated.

Return type:

datetime.datetime

Example

>>> session, datasets = ses2records(self.get_details(eid, full=True))
... self._update_cache_from_records(sessions=session, datasets=datasets)

Raises:

AssertionError – When strict is True the input columns must exactly match those oo the cache table, including the order.
KeyError – One or more of the keyword arguments does not match a table in cache.

remove_table_files(folder, tables=('sessions', 'datasets'))[source]

Delete cache tables on disk.

Parameters:

folder (pathlib.Path) – The directory path containing cache tables to remove.
tables (list of str) – A list of table names to remove, e.g. [‘sessions’, ‘datasets’]. NB: This will also delete the cache_info.json metadata file.

Returns:

A list of the removed files.

Return type:

list of pathlib.Path

remove_missing_datasets(cache_dir, tables=None, remove_empty_sessions=True, dry=True)[source]

Remove dataset files and session folders that are not in the provided cache.

NB: This does not remove entries from the cache tables that are missing on disk. Non-ALF files are not removed. Empty sessions that exist in the sessions table are not removed.

Parameters:

cache_dir (str, pathlib.Path)
tables (dict[str, pandas.DataFrame], optional) – A dict with keys (‘sessions’, ‘datasets’), containing the cache tables as DataFrames.
remove_empty_sessions (bool) – Attempt to remove session folders that are empty and not in the sessions table.
dry (bool) – If true, do not remove anything.

Returns:

A sorted list of paths to be removed.

Return type:

list

default_cache(origin='')[source]

Returns an empty cache dictionary with the default tables.

Parameters:: origin (str, optional) – The origin of the cache (e.g. a computer name or database name).
Returns:: A Bunch object containing the loaded cache tables and associated metadata.
Return type:: Bunch

QC_TYPE = CategoricalDtype(categories=['NOT_SET', 'PASS', 'WARNING', 'FAIL', 'CRITICAL'], ordered=True, categories_dtype=object)

The cache table QC column data type.

Type:: pandas.api.types.CategoricalDtype

EMPTY_DATASETS_FRAME = Empty DataFrame Columns: [rel_path, file_size, hash, exists, qc] Index: []

An empty datasets dataframe with correct columns and dtypes.

Type:: pandas.DataFrame

EMPTY_SESSIONS_FRAME = Empty DataFrame Columns: [lab, subject, date, number, task_protocol, projects] Index: []

An empty sessions dataframe with correct columns and dtypes.

Type:: pandas.DataFrame