ibllib.oneibl.data_handlers

Downloading of task dependent datasets and registration of task output datasets.

The DataHandler class is used by the pipes.tasks.Task class to ensure dependent datasets are present and to register and upload the output datasets. For examples on how to run a task using specific data handlers, see ibllib.pipes.tasks().

Functions

dataset_from_name

From a list of ExpectedDataset instances, return those that match a given name.

update_collections

Update the collection of a dataset.

Classes

DataHandler

ExpectedDataset

An expected input or output dataset.

Input

An expected input dataset.

LocalDataHandler

OptionalDataset

An expected dataset that is not strictly required.

OptionalInput

An optional expected input dataset.

OptionalOutput

An optional expected output dataset.

Output

An expected output dataset.

PopeyeDataHandler

RemoteAwsDataHandler

RemoteEC2DataHandler

RemoteGlobusDataHandler

Data handler for running tasks on remote compute node.

RemoteHttpDataHandler

SDSCDataHandler

Data handler for running tasks on SDSC compute node

ServerDataHandler

ServerGlobusDataHandler

class ExpectedDataset(name, collection, register=None, revision=None, unique=True)[source]

Bases: object

An expected input or output dataset.

inverted = False
property register

whether to register the output file.

Type:

bool

property identifiers

the identifying parts of the dataset.

If no operator is applied, the identifiers are (collection, revision, name). If an operator is applied, a tuple of 3-element tuples is returned.

Type:

tuple

property glob_pattern

one or more glob patterns.

Type:

str, tuple of str

find_files(session_path, register=False)[source]

Find files on disk.

Uses glob patterns to find dataset(s) on disk.

Parameters:
  • session_path (pathlib.Path, str) – A session path within which to glob for the dataset(s).

  • register (bool) – Only return files intended to be registered.

Returns:

  • bool – True if the dataset is found on disk or is optional.

  • list of pathlib.Path – A list of matching dataset files.

  • missing, None, str, set of str – One or more glob patterns that either didn’t yield files (or did in the case of inverted datasets).

Notes

  • Currently if unique is true and multiple files are found, all files are returned without an exception raised although this may change in the future.

  • If register is false, all files are returned regardless of whether they are intended to be registered.

  • If inverted is true, and files are found, the glob pattern is returned as missing.

  • If XOR, returns all patterns if all are present when only one should be, otherwise returns all missing patterns.

  • Missing (or unexpectedly found) patterns are returned despite the dataset being optional.

filter(session_datasets, **kwargs)[source]

Filter dataset frame by expected datasets.

Parameters:
  • session_datasets (pandas.DataFrame) – An data frame of session datasets.

  • kwargs – Extra arguments for one.util.filter_datasets, namely revision_last_before, qc, and ignore_qc_not_set.

Returns:

  • bool – True if the required dataset(s) are present in the data frame.

  • pandas.DataFrame – A filtered data frame of containing the expected dataset(s).

static input(name, collection, required=True, register=False, **kwargs)[source]

Create an expected input dataset.

By default, expected input datasets are not automatically registered.

Parameters:
  • name (str) – A dataset name or glob pattern.

  • collection (str, None) – An ALF collection or pattern.

  • required (bool) – Whether file must always be present, or is an optional dataset. Default is True.

  • register (bool) – Whether to register the input file. Default is False for input files, True for output files.

  • revision (str) – An optional revision.

  • unique (bool) – Whether identifier pattern is expected to match a single dataset or several.

Returns:

An instance of an Input dataset if required is true, otherwise an OptionalInput.

Return type:

Input, OptionalInput

static output(name, collection, required=True, register=True, **kwargs)[source]

Create an expected output dataset.

By default, expected output datasets are automatically registered.

Parameters:
  • name (str) – A dataset name or glob pattern.

  • collection (str, None) – An ALF collection or pattern.

  • required (bool) – Whether file must always be present, or is an optional dataset. Default is True.

  • register (bool) – Whether to register the output file. Default is False for input files, True for output files.

  • revision (str) – An optional revision.

  • unique (bool) – Whether identifier pattern is expected to match a single dataset or several.

Returns:

An instance of an Output dataset if required is true, otherwise an OptionalOutput.

Return type:

Output, OptionalOutput

class OptionalDataset(name, collection, register=None, revision=None, unique=True)[source]

Bases: ExpectedDataset

An expected dataset that is not strictly required.

find_files(session_path, register=False)[source]

Find files on disk.

Uses glob patterns to find dataset(s) on disk.

Parameters:
  • session_path (pathlib.Path, str) – A session path within which to glob for the dataset(s).

  • register (bool) – Only return files intended to be registered.

Returns:

  • True – Always True as dataset is optional.

  • list of pathlib.Path – A list of matching dataset files.

  • missing, None, str, set of str – One or more glob patterns that either didn’t yield files (or did in the case of inverted datasets).

Notes

  • Currently if unique is true and multiple files are found, all files are returned without an exception raised although this may change in the future.

  • If register is false, all files are returned regardless of whether they are intended to be registered.

  • If inverted is true, and files are found, the glob pattern is returned as missing.

  • If XOR, returns all patterns if all are present when only one should be, otherwise returns all missing patterns.

  • Missing (or unexpectedly found) patterns are returned despite the dataset being optional.

filter(session_datasets, **kwargs)[source]

Filter dataset frame by expected datasets.

Parameters:
  • session_datasets (pandas.DataFrame) – An data frame of session datasets.

  • kwargs – Extra arguments for one.util.filter_datasets, namely revision_last_before, qc, ignore_qc_not_set, and assert_unique.

Returns:

  • True – Always True as dataset is optional.

  • pandas.DataFrame – A filtered data frame of containing the expected dataset(s).

class Input(name, collection, register=None, revision=None, unique=True)[source]

Bases: ExpectedDataset

An expected input dataset.

class OptionalInput(name, collection, register=None, revision=None, unique=True)[source]

Bases: Input, OptionalDataset

An optional expected input dataset.

class Output(name, collection, register=None, revision=None, unique=True)[source]

Bases: ExpectedDataset

An expected output dataset.

class OptionalOutput(name, collection, register=None, revision=None, unique=True)[source]

Bases: Output, OptionalDataset

An optional expected output dataset.

dataset_from_name(name, datasets)[source]

From a list of ExpectedDataset instances, return those that match a given name.

Parameters:
  • name (str) – The name of the dataset.

  • datasets (list of ExpectedDataset) – A list of ExpectedDataset instances.

Returns:

The ExpectedDataset instances that match the given name.

Return type:

list of ExpectedDataset

update_collections(dataset, new_collection, substring=None, unique=None)[source]

Update the collection of a dataset.

This updates all nested ExpectedDataset instances with the new collection and returns copies.

Parameters:
  • dataset (ExpectedDataset) – The dataset to update.

  • new_collection (str, list of str) – The new collection or collections.

  • substring (str, optional) – An optional substring in the collection to replace with new collection(s). If None, the entire collection will be replaced.

Returns:

A copy of the dataset with the updated collection(s).

Return type:

ExpectedDataset

class DataHandler(session_path, signature, one=None)[source]

Bases: ABC

setUp(**kwargs)[source]

Function to optionally overload to download required data to run task.

getData(one=None)[source]

Finds the datasets required for task based on input signatures.

Parameters:

one (one.api.One, optional) – An instance of ONE to use.

Returns:

A data frame of required datasets. An empty frame is returned if no registered datasets are required, while None is returned if no instance of ONE is set.

Return type:

pandas.DataFrame, None

getOutputFiles()[source]

Return a data frame of output datasets found on disk.

Returns:

A dataset data frame of datasets on disk that were specified in signature[‘output_files’].

Return type:

pandas.DataFrame

uploadData(outputs, version)[source]

Function to optionally overload to upload and register data

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

cleanUp(**kwargs)[source]

Function to optionally overload to clean up files after running task.

class LocalDataHandler(session_path, signatures, one=None)[source]

Bases: DataHandler

class ServerDataHandler(session_path, signatures, one=None)[source]

Bases: DataHandler

uploadData(outputs, version, clobber=False, **kwargs)[source]

Upload and/or register output data.

This is typically called by ibllib.pipes.tasks.Task.register_datasets().

Parameters:
  • outputs (list of pathlib.Path) – A set of ALF paths to register to Alyx.

  • version (str, list of str) – The version of ibllib used to generate these output files.

  • clobber (bool) – If True, re-upload outputs that have already been passed to this method.

  • kwargs – Optional keyword arguments for one.registration.RegistrationClient.register_files.

Returns:

A list of newly created Alyx dataset records or the registration data if dry.

Return type:

list of dicts, dict

cleanUp(**_)[source]

Empties and returns the processed dataset mep.

class ServerGlobusDataHandler(session_path, signatures, one=None)[source]

Bases: DataHandler

setUp(**_)[source]

Function to download necessary data to run tasks using globus-sdk.

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

cleanUp(**_)[source]

Clean up, remove the files that were downloaded from Globus once task has completed.

class RemoteEC2DataHandler(session_path, signature, one=None)[source]

Bases: DataHandler

setUp(**_)[source]

Function to download necessary data to run tasks using ONE :return:

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task via S3 patcher

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

class RemoteHttpDataHandler(session_path, signature, one=None)[source]

Bases: DataHandler

setUp(**_)[source]

Function to download necessary data to run tasks using ONE :return:

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task via FTP patcher

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

class RemoteAwsDataHandler(session_path, signature, one=None)[source]

Bases: DataHandler

setUp(**_)[source]

Function to download necessary data to run tasks using AWS boto3.

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task via FTP patcher

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

cleanUp(task)[source]

Clean up, remove the files that were downloaded from globus once task has completed.

class RemoteGlobusDataHandler(session_path, signature, one=None)[source]

Bases: DataHandler

Data handler for running tasks on remote compute node. Will download missing data using Globus.

Parameters:
  • session_path – path to session

  • signature – input and output file signatures

  • one – ONE instance

setUp(**_)[source]

Function to download necessary data to run tasks using globus.

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task via FTP patcher

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

class SDSCDataHandler(session_path, signatures, one=None)[source]

Bases: DataHandler

Data handler for running tasks on SDSC compute node

Parameters:
  • session_path – path to session

  • signature – input and output file signatures

  • one – ONE instance

setUp(task)[source]

Function to create symlinks to necessary data to run tasks.

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task via SDSC patcher

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

cleanUp(task)[source]

Function to clean up symlinks created to run task.

class PopeyeDataHandler(session_path, signatures, one=None)[source]

Bases: SDSCDataHandler

uploadData(outputs, version, **kwargs)[source]

Function to upload and register data of completed task via SDSC patcher

Parameters:
  • outputs – output files from task to register

  • version – ibllib version

Returns:

output info of registered datasets

cleanUp(**_)[source]

Symlinks are preserved until registration.