ibllib.oneibl.data_handlers
Downloading of task dependent datasets and registration of task output datasets.
The DataHandler class is used by the pipes.tasks.Task class to ensure dependent datasets are
present and to register and upload the output datasets. For examples on how to run a task using
specific data handlers, see ibllib.pipes.tasks()
.
Functions
From a list of ExpectedDataset instances, return those that match a given name. |
|
Update the collection of a dataset. |
Classes
An expected input or output dataset. |
|
An expected input dataset. |
|
An expected dataset that is not strictly required. |
|
An optional expected input dataset. |
|
An optional expected output dataset. |
|
An expected output dataset. |
|
Data handler for running tasks on remote compute node. |
|
Data handler for running tasks on SDSC compute node |
|
- class ExpectedDataset(name, collection, register=None, revision=None, unique=True)[source]
Bases:
object
An expected input or output dataset.
- inverted = False
- property register
whether to register the output file.
- Type:
bool
- property identifiers
the identifying parts of the dataset.
If no operator is applied, the identifiers are (collection, revision, name). If an operator is applied, a tuple of 3-element tuples is returned.
- Type:
tuple
- property glob_pattern
one or more glob patterns.
- Type:
str, tuple of str
- find_files(session_path, register=False)[source]
Find files on disk.
Uses glob patterns to find dataset(s) on disk.
- Parameters:
session_path (pathlib.Path, str) – A session path within which to glob for the dataset(s).
register (bool) – Only return files intended to be registered.
- Returns:
bool – True if the dataset is found on disk or is optional.
list of pathlib.Path – A list of matching dataset files.
missing, None, str, set of str – One or more glob patterns that either didn’t yield files (or did in the case of inverted datasets).
Notes
Currently if unique is true and multiple files are found, all files are returned without an exception raised although this may change in the future.
If register is false, all files are returned regardless of whether they are intended to be registered.
If inverted is true, and files are found, the glob pattern is returned as missing.
If XOR, returns all patterns if all are present when only one should be, otherwise returns all missing patterns.
Missing (or unexpectedly found) patterns are returned despite the dataset being optional.
- filter(session_datasets, **kwargs)[source]
Filter dataset frame by expected datasets.
- Parameters:
session_datasets (pandas.DataFrame) – An data frame of session datasets.
kwargs – Extra arguments for one.util.filter_datasets, namely revision_last_before, qc, and ignore_qc_not_set.
- Returns:
bool – True if the required dataset(s) are present in the data frame.
pandas.DataFrame – A filtered data frame of containing the expected dataset(s).
- static input(name, collection, required=True, register=False, **kwargs)[source]
Create an expected input dataset.
By default, expected input datasets are not automatically registered.
- Parameters:
name (str) – A dataset name or glob pattern.
collection (str, None) – An ALF collection or pattern.
required (bool) – Whether file must always be present, or is an optional dataset. Default is True.
register (bool) – Whether to register the input file. Default is False for input files, True for output files.
revision (str) – An optional revision.
unique (bool) – Whether identifier pattern is expected to match a single dataset or several.
- Returns:
An instance of an Input dataset if required is true, otherwise an OptionalInput.
- Return type:
- static output(name, collection, required=True, register=True, **kwargs)[source]
Create an expected output dataset.
By default, expected output datasets are automatically registered.
- Parameters:
name (str) – A dataset name or glob pattern.
collection (str, None) – An ALF collection or pattern.
required (bool) – Whether file must always be present, or is an optional dataset. Default is True.
register (bool) – Whether to register the output file. Default is False for input files, True for output files.
revision (str) – An optional revision.
unique (bool) – Whether identifier pattern is expected to match a single dataset or several.
- Returns:
An instance of an Output dataset if required is true, otherwise an OptionalOutput.
- Return type:
- class OptionalDataset(name, collection, register=None, revision=None, unique=True)[source]
Bases:
ExpectedDataset
An expected dataset that is not strictly required.
- find_files(session_path, register=False)[source]
Find files on disk.
Uses glob patterns to find dataset(s) on disk.
- Parameters:
session_path (pathlib.Path, str) – A session path within which to glob for the dataset(s).
register (bool) – Only return files intended to be registered.
- Returns:
True – Always True as dataset is optional.
list of pathlib.Path – A list of matching dataset files.
missing, None, str, set of str – One or more glob patterns that either didn’t yield files (or did in the case of inverted datasets).
Notes
Currently if unique is true and multiple files are found, all files are returned without an exception raised although this may change in the future.
If register is false, all files are returned regardless of whether they are intended to be registered.
If inverted is true, and files are found, the glob pattern is returned as missing.
If XOR, returns all patterns if all are present when only one should be, otherwise returns all missing patterns.
Missing (or unexpectedly found) patterns are returned despite the dataset being optional.
- filter(session_datasets, **kwargs)[source]
Filter dataset frame by expected datasets.
- Parameters:
session_datasets (pandas.DataFrame) – An data frame of session datasets.
kwargs – Extra arguments for one.util.filter_datasets, namely revision_last_before, qc, ignore_qc_not_set, and assert_unique.
- Returns:
True – Always True as dataset is optional.
pandas.DataFrame – A filtered data frame of containing the expected dataset(s).
- class Input(name, collection, register=None, revision=None, unique=True)[source]
Bases:
ExpectedDataset
An expected input dataset.
- class OptionalInput(name, collection, register=None, revision=None, unique=True)[source]
Bases:
Input
,OptionalDataset
An optional expected input dataset.
- class Output(name, collection, register=None, revision=None, unique=True)[source]
Bases:
ExpectedDataset
An expected output dataset.
- class OptionalOutput(name, collection, register=None, revision=None, unique=True)[source]
Bases:
Output
,OptionalDataset
An optional expected output dataset.
- dataset_from_name(name, datasets)[source]
From a list of ExpectedDataset instances, return those that match a given name.
- Parameters:
name (str) – The name of the dataset.
datasets (list of ExpectedDataset) – A list of ExpectedDataset instances.
- Returns:
The ExpectedDataset instances that match the given name.
- Return type:
list of ExpectedDataset
- update_collections(dataset, new_collection, substring=None, unique=None)[source]
Update the collection of a dataset.
This updates all nested ExpectedDataset instances with the new collection and returns copies.
- Parameters:
dataset (ExpectedDataset) – The dataset to update.
new_collection (str, list of str) – The new collection or collections.
substring (str, optional) – An optional substring in the collection to replace with new collection(s). If None, the entire collection will be replaced.
- Returns:
A copy of the dataset with the updated collection(s).
- Return type:
- class DataHandler(session_path, signature, one=None)[source]
Bases:
ABC
- getData(one=None)[source]
Finds the datasets required for task based on input signatures.
- Parameters:
one (one.api.One, optional) – An instance of ONE to use.
- Returns:
A data frame of required datasets. An empty frame is returned if no registered datasets are required, while None is returned if no instance of ONE is set.
- Return type:
pandas.DataFrame, None
- getOutputFiles()[source]
Return a data frame of output datasets found on disk.
- Returns:
A dataset data frame of datasets on disk that were specified in signature[‘output_files’].
- Return type:
pandas.DataFrame
- class LocalDataHandler(session_path, signatures, one=None)[source]
Bases:
DataHandler
- class ServerDataHandler(session_path, signatures, one=None)[source]
Bases:
DataHandler
- uploadData(outputs, version, clobber=False, **kwargs)[source]
Upload and/or register output data.
This is typically called by
ibllib.pipes.tasks.Task.register_datasets()
.- Parameters:
outputs (list of pathlib.Path) – A set of ALF paths to register to Alyx.
version (str, list of str) – The version of ibllib used to generate these output files.
clobber (bool) – If True, re-upload outputs that have already been passed to this method.
kwargs – Optional keyword arguments for one.registration.RegistrationClient.register_files.
- Returns:
A list of newly created Alyx dataset records or the registration data if dry.
- Return type:
list of dicts, dict
- class ServerGlobusDataHandler(session_path, signatures, one=None)[source]
Bases:
DataHandler
- class RemoteEC2DataHandler(session_path, signature, one=None)[source]
Bases:
DataHandler
- class RemoteHttpDataHandler(session_path, signature, one=None)[source]
Bases:
DataHandler
- class RemoteAwsDataHandler(session_path, signature, one=None)[source]
Bases:
DataHandler
- class RemoteGlobusDataHandler(session_path, signature, one=None)[source]
Bases:
DataHandler
Data handler for running tasks on remote compute node. Will download missing data using Globus.
- Parameters:
session_path – path to session
signature – input and output file signatures
one – ONE instance
- class SDSCDataHandler(session_path, signatures, one=None)[source]
Bases:
DataHandler
Data handler for running tasks on SDSC compute node
- Parameters:
session_path – path to session
signature – input and output file signatures
one – ONE instance
- class PopeyeDataHandler(session_path, signatures, one=None)[source]
Bases:
SDSCDataHandler