{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading with ONE\n",
    "The datasets are organized into directory trees by subject, date and session number.  For a\n",
    "given session there are data files grouped by object (e.g. 'trials'), each with a specific\n",
    "attribute (e.g. 'rewardVolume').  The dataset name follows the pattern 'object.attribute',\n",
    "for example 'trials.rewardVolume'.  For more information, see the [ALF documentation](../alyx_files.html).\n",
    "\n",
    "An [experiment ID](../experiment_ids.html) (eid) is a string that uniquely identifies a session,\n",
    "for example a combinationof subject date and number (e.g. KS023/2019-12-10/001), a file path (e.g.\n",
    "C:\\Users\\Subjects\\KS023\\2019-12-10\\001), or a UUID (e.g. aad23144-0e52-4eac-80c5-c4ee2decb198).\n",
    "\n",
    "If the data don't exist locally, they will be downloaded, then loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-09-07T19:17:08.394242Z",
     "iopub.status.busy": "2021-09-07T19:17:08.393245Z",
     "iopub.status.idle": "2021-09-07T19:17:08.632883Z",
     "shell.execute_reply": "2021-09-07T19:17:08.630864Z"
    }
   },
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "from one.api import ONE\n",
    "import one.alf.io as alfio\n",
    "\n",
    "one = ONE(base_url='https://openalyx.internationalbrainlab.org', silent=True)\n",
    "\n",
    "# To load all the data for a given object, use the load_object method:\n",
    "eid = 'KS023/2019-12-10/001'  # subject/date/number\n",
    "trials = one.load_object(eid, 'trials')  # Returns a dict-like object of numpy arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "The attributes of returned object mirror the datasets:"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['contrastLeft', 'intervals', 'response_times', 'stimOff_times', 'goCueTrigger_times', 'itiDuration', 'goCue_times', 'contrastRight', 'intervals_bpod', 'feedbackType', 'stimOn_times', 'choice', 'firstMovement_times', 'rewardVolume', 'feedback_times', 'probabilityLeft'])\n",
      "[1.5 1.5 1.5 0.  1.5]\n",
      "[1.5 1.5 1.5 0.  1.5]\n"
     ]
    }
   ],
   "source": [
    "print(trials.keys())\n",
    "# The data can be accessed with dot syntax\n",
    "print(trials.rewardVolume[:5])\n",
    "# ... or dictionary syntax\n",
    "print(trials['rewardVolume'][:5])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "All arrays in the object have the same length (the size of the first dimension) and can\n",
    "therefore be converted to a DataFrame:"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "outputs": [],
   "source": [
    "trials.to_df().head()\n",
    "\n",
    "# For analysis you can assert that the dimensions match using the check_dimensions property:\n",
    "assert trials.check_dimensions == 0"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "If we only want to load in certain attributes of an object we can use the following"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['intervals', 'intervals_bpod', 'rewardVolume', 'probabilityLeft'])\n"
     ]
    }
   ],
   "source": [
    "trials = one.load_object(eid, 'trials', attribute=['intervals', 'rewardVolume', 'probabilityLeft'])\n",
    "print(trials.keys())"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Datasets can be individually downloaded using the `load_dataset` method.  This\n",
    "function takes an experiment ID and a dataset name as positional args."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "outputs": [],
   "source": [
    "reward_volume = one.load_dataset(eid, '_ibl_trials.rewardVolume.npy')  # c.f. load_object, above"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We can use the load_datasets method to load multiple datasets at once. This method returns two\n",
    "lists, the first which contains the data for each dataset and the second which contains meta\n",
    "information about the data.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "Note.\n",
    "\n",
    "When the `assert_present` flag can be set to false, if a given dataset doesn't exist a None is\n",
    "returned instead of raising an exception.\n",
    "</div>"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'exists': True,\n",
      " 'file_size': 5256.0,\n",
      " 'hash': '819ae9cc4643cc7ed6cf8453e6cec339',\n",
      " 'id_0': 8593347991464373244,\n",
      " 'id_1': -3444378546711777370,\n",
      " 'rel_path': 'alf/_ibl_trials.rewardVolume.npy',\n",
      " 'revision': '',\n",
      " 'session_path': 'public/cortexlab/Subjects/KS023/2019-12-10/001'}\n"
     ]
    }
   ],
   "source": [
    "data, info = one.load_datasets(eid, datasets=['_ibl_trials.rewardVolume.npy',\n",
    "                                              '_ibl_trials.probabilityLeft.npy'])\n",
    "pprint(info[0])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Collections\n",
    "\n",
    "For any given session there may be multiple datasets with the same name that are organized into\n",
    "separate subfolders called collections.  For example there may be spike times for two probes, one\n",
    "in 'alf/probe00/spikes.times.npy', the other in 'alf/probe01/spikes.times.npy'.  In IBL, the 'alf'\n",
    "directory (for ALyx Files) contains the main datasets that people use.  Raw data is in other\n",
    "directories.\n",
    "\n",
    "In this case you must specify the collection when multiple matching datasets are found:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "outputs": [],
   "source": [
    "probe1_spikes = one.load_dataset(eid, 'spikes.times.npy', collection='alf/probe01')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "It is also possible to load datasets from different collections at the same time. For example if\n",
    " we want to simultaneously load a trials dataset and a clusters dataset we would type,"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "outputs": [],
   "source": [
    "data, info = one.load_datasets(eid, datasets=['_ibl_trials.rewardVolume.npy', 'clusters.waveforms.npy'],\n",
    "                               collections=['alf', 'alf/probe01'])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Revisions\n",
    "\n",
    "Revisions provide an optional way to organize data by version.  The version label is\n",
    "arbitrary, however the folder must start and end with pound signs and is typically an ISO date,\n",
    "e.g. \"#2021-01-01#\". Unlike collections, if a specified revision is not found, the previous\n",
    "revision will be returned.  The revisions are ordered lexicographically.\n",
    "\n",
    "```python\n",
    "intervals = one.load_dataset(eid, 'trials.intervals.npy', revision='2021-03-15a')\n",
    "```\n",
    "\n",
    "## Download only\n",
    "\n",
    "By default the load methods will download any missing data, then load and return the data.\n",
    "When the 'download_only' kwarg is true, the data are not loaded.  Instead a list of file paths\n",
    "are returned, and any missing datasets are represented by None."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "outputs": [],
   "source": [
    "files = one.load_object(eid, 'trials', download_only=True)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "You can load objects and datasets from a file path"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "outputs": [],
   "source": [
    "trials = one.load_object(files[0], 'trials')\n",
    "contrast_left = one.load_dataset(files[0], files[0].name)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Advanced loading\n",
    "\n",
    "The load methods typically require an exact match, therefore when loading '\\_ibl_wheel.position\n",
    ".npy'\n",
    "`one.load_dataset(eid, 'wheel.position.npy')` will raise an exception because the namespace is\n",
    "missing. Likewise `one.load_object(eid, 'trial')` will fail because 'trial' != 'trials'.\n",
    "\n",
    "Loading can be done using unix shell style wildcards, allowing you to load objects and datasets\n",
    "that match a particular pattern, e.g. `one.load_dataset(eid, '*wheel.position.npy')`.\n",
    "\n",
    "By default wildcard mode is on.  In this mode, the extension may be omitted, e.g.\n",
    "`one.load_dataset(eid, 'spikes.times')`. This is equivalent to 'spikes.times.\\*'. Note that an\n",
    "exception will be raised if datasets with more than one extension are found (such as\n",
    "'spikes.times.npy' and 'spikes.times.csv').  When loading a dataset with extra parts,\n",
    "the extension (or wildcard) is explicitly required: 'spikes.times.part1.*'.\n",
    "\n",
    "If you set the wildcards property of One to False, loading will be done using regular expressions,\n",
    "allowing for more powerful pattern matching.\n",
    "\n",
    "Below is table showing how to express unix style wildcards as a regular expression:\n",
    "\n",
    "| Regex | Wildcard | Description              | Example                |\n",
    "|-------|----------|--------------------------|------------------------|\n",
    "| .*    | *        | Match zero or more chars | spikes.times.*         |\n",
    "| .?    | ?        | Match one char           | timestamps.?sv         |\n",
    "| []    | []       | Match a range of chars   | obj.attr.part[0-9].npy |\n",
    "\n",
    "NB: In regex '.' means 'any character'; to match '.' exactly, escape it with a backslash\n",
    "\n",
    "Examples:\n",
    "    spikes.times.* (regex), spikes.times* (wildcard) matches...\n",
    "\n",
    "        spikes.times.npy\n",
    "        spikes.times\n",
    "        spikes.times_ephysClock.npy\n",
    "        spikes.times.bin\n",
    "\n",
    "    clusters.uuids..?sv (regex), clusters.uuids.?sv (wildcard) matches...\n",
    "\n",
    "        clusters.uuids.ssv\n",
    "        clusters.uuids.csv\n",
    "\n",
    "    alf/probe0[0-5] (regex), alf/probe0[0-5] (wildcard) matches...\n",
    "\n",
    "        alf/probe00\n",
    "        alf/probe01\n",
    "        [...]\n",
    "        alf/probe05\n",
    "\n",
    "\n",
    "### Filtering attributes\n",
    "To download and load only a subset of attributes, you can provide a list to the attribute kwarg."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "outputs": [],
   "source": [
    "spikes = one.load_object(eid, 'spikes', collection='alf/probe01', attribute=['time*', 'clusters'])\n",
    "assert 'amps' not in spikes"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Loading with file name parts\n",
    "You may also specify specific parts of the filename for even more specific filtering.  Here a\n",
    "list of options will be treated as a logical OR\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "Note.\n",
    "\n",
    "All fields accept wildcards.\n",
    "</div>"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "outputs": [],
   "source": [
    "dataset = dict(object='spikes', attribute='times', extension=['npy', 'bin'])\n",
    "probe1_spikes = one.load_dataset(eid, dataset, collection='alf/probe01')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### More regex examples\n",
    "```python\n",
    "one.wildcards = False\n",
    "```\n",
    "\n",
    "Load specific attributes from an object ('|' represents a logical OR in regex)\n",
    "```python\n",
    "spikes = one.load_object(eid, 'spikes', collection='alf/probe01', attribute='times|clusters')\n",
    "assert 'amps' not in spikes\n",
    "```\n",
    "\n",
    "Load a dataset ignoring any namespace or extension:\n",
    "```python\n",
    "spike_times = one.load_dataset(eid, '.*spikes.times.*', collection='alf/probe01')\n",
    "```\n",
    "\n",
    "List all datasets in any probe collection (matches 0 or more of any number)\n",
    "```python\n",
    "dsets = one.list_datasets(eid, collection='alf/probe[0-9]*')\n",
    "```\n",
    "\n",
    "Load object attributes that are not delimited text files (i.e. tsv, ssv, csv, etc.)\n",
    "```python\n",
    "files = one.load_object(eid, 'clusters', extension='[^sv]*', download_only=True)\n",
    "assert not any(str(x).endswith('csv') for x in files)\n",
    "```\n",
    "\n",
    "### Load spike times from a probe UUID\n",
    "```python\n",
    "pid = 'b749446c-18e3-4987-820a-50649ab0f826'\n",
    "session, probe = one.pid2eid(pid)\n",
    "spikes_times = one.load_dataset(session, 'spikes.times.npy', collection=f'alf/{probe}')\n",
    "```\n",
    "\n",
    "List all probes for a session\n",
    "```python\n",
    "print([x for x in one.list_collections(session) if 'alf/probe' in x])\n",
    "```"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Loading with relative paths\n",
    "You may also the complete dataset path, relative to the session path. When doing this the path must\n",
    "be complete (i.e. without wildcards) and the collection and revision arguments must be None.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "Note.\n",
    "\n",
    "To ensure you're loading the default revision (usually the most recent and correct data), do not\n",
    "explicitly provide the relative path or revision, and ONE will return the default automatically.\n",
    "</div>\n",
    "\n",
    "```python\n",
    "spikes_times = one.load_dataset(eid, 'alf/probe00/spikes.times.npy')\n",
    "```\n",
    "\n",
    "Download all the raw data for a given session*\n",
    "```python\n",
    "dsets = one.list_datasets(eid, collection='raw_*_data')\n",
    "one.load_datasets(eid, dsets, download_only=True)\n",
    "```\n",
    "*NB: This will download all revisions of the same data; for this reason it is better to objects and\n",
    "collections individually, or to provide dataset names instead of relative paths."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Loading with timeseries\n",
    "For loading a dataset along with its timestamps, alf.io.read_ts can be used. It requires a\n",
    "filepath as input."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "outputs": [],
   "source": [
    "files = one.load_object(eid, 'spikes', collection='alf/probe01', download_only=True)\n",
    "ts, clusters = alfio.read_ts(files[1])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Loading collections\n",
    "You can load whole collections with the `load_collection` method.  For example to load the\n",
    "spikes and clusters objects for probe01:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "probe01 = one.load_collection(eid, '*probe01', object=['spikes', 'clusters'])\n",
    "probe01.spikes.times[:5]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The download_only flag here provides a simple way to download all datasets within a collection:\n",
    "```\n",
    "one.load_collection(eid, 'alf/probe01', download_only=True)\n",
    "```"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "More information about these methods can be found using the help command"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on method load_dataset in module one.api:\n",
      "\n",
      "load_dataset(eid: Union[str, pathlib.Path, uuid.UUID], dataset: str, collection: Union[str, NoneType] = None, revision: Union[str, NoneType] = None, query_type: Union[str, NoneType] = None, download_only: bool = False, **kwargs) -> Any method of one.api.OneAlyx instance\n",
      "    Load a single dataset for a given session id and dataset name\n",
      "    \n",
      "    Parameters\n",
      "    ----------\n",
      "    eid : str, UUID, pathlib.Path, dict\n",
      "        Experiment session identifier; may be a UUID, URL, experiment reference string\n",
      "        details dict or Path.\n",
      "    dataset : str, dict\n",
      "        The ALF dataset to load.  May be a string or dict of ALF parts.  Supports asterisks as\n",
      "        wildcards.\n",
      "    collection : str\n",
      "        The collection to which the object belongs, e.g. 'alf/probe01'.\n",
      "        This is the relative path of the file from the session root.\n",
      "        Supports asterisks as wildcards.\n",
      "    revision : str\n",
      "        The dataset revision (typically an ISO date).  If no exact match, the previous\n",
      "        revision (ordered lexicographically) is returned.  If None, the default revision is\n",
      "        returned (usually the most recent revision).  Regular expressions/wildcards not\n",
      "        permitted.\n",
      "    query_type : str\n",
      "        Query cache ('local') or Alyx database ('remote')\n",
      "    download_only : bool\n",
      "        When true the data are downloaded and the file path is returned.\n",
      "    \n",
      "    Returns\n",
      "    -------\n",
      "    Dataset or a Path object if download_only is true.\n",
      "    \n",
      "    Examples\n",
      "    --------\n",
      "    intervals = one.load_dataset(eid, '_ibl_trials.intervals.npy')\n",
      "    # Load dataset without specifying extension\n",
      "    intervals = one.load_dataset(eid, 'trials.intervals')  # wildcard mode only\n",
      "    intervals = one.load_dataset(eid, '*trials.intervals*')  # wildcard mode only\n",
      "    filepath = one.load_dataset(eid '_ibl_trials.intervals.npy', download_only=True)\n",
      "    spike_times = one.load_dataset(eid 'spikes.times.npy', collection='alf/probe01')\n",
      "    old_spikes = one.load_dataset(eid, 'spikes.times.npy',\n",
      "                                  collection='alf/probe01', revision='2020-08-31')\n"
     ]
    }
   ],
   "source": [
    "help(one.load_dataset)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Loading aggregate datasets\n",
    "All raw and preprocessed data are stored at the session level, however some datasets are aggregated\n",
    "over a subject, project, or tag (called a 'relation'). Such datasets can be loaded using the `load_aggregate`\n",
    "method.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "Note.\n",
    "\n",
    "NB: This method is only available in 'remote' mode.\n",
    "</div>"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "subject = 'SWC_043'\n",
    "subject_trials = one.load_aggregate('subjects', subject, '_ibl_subjectTrials.table')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "docs_executed": "errored",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}