10 Example file format: NWB files

Code

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot pynwb bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"

import polars as pl
import numpy as np

import h5py
import pynwb

import iqplot

import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

In this example, we will open up an NWB (Neurodata Without Borders) file containing electrophysiologial recordings of neurons in the human brain. NWB was created as a standard for neuroscience data so that they may be shared across research groups. This is stated clearly in their description of the data standard:

Neurodata Without Borders (NWB) is a data standard for neurophysiology, providing neuroscientists with a common standard to share, archive, use, and build common analysis tools for neurophysiology data. NWB is designed to store a variety of neurophysiology data, including data from intracellular and extracellular electrophysiology experiments, data from optical physiology experiments, and tracking and stimulus data. The project includes not only the NWB format, but also a broad range of software for data standardization and application programming interfaces (APIs) for reading and writing the data as well as high-value data sets that have been translated into the NWB data standard.)

NWB files are stored in HDF5 format with a specific schema for their structure (with a more detailed schema here). Therefore, we can use an HDF5 parser to explore NWB data sets.

HDF stands for “hierarchical data format,” meaning that a single HDF file can contain a file system-like structure of data. Each entry is either a dataset, which contains an array of data, or a group, which contains data sets an other groups. You can think of groups like directories in a file system. Both datasets and groups can have arbitrary attributes which are (possibly formless) metadata. Importantly, HDF5 files are highly optimized for large data sets. When working with them, you do not need to store the data in RAM, but can rapidly access portions of the data from disk as needed.

To see how this works, let’s consider an example. We will work with the file sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb, available here. It may also be directly downloaded from the DANDI Archive by clicking here. The data set comes from the lab or Ueli Rutishauser. It was featured in a publication describing NWB by Chandravadia, et al., originally coming from their paper by Faraut, et al. highlighting the data set on declarative memory encoding and recognition in humans. The data set includes recordings from single neurons from human amygdala and hippocampus. The subjects were shown a set of images. They were then shown another set of images, some of which they had already seen. Neuronal spike times were recorded during this process. We will look at a portion of this data set measured from electrodes in the right amygdala and left hippocampus of a single subject, a 31-year-old male. The entire data set is freely available from the DANDI archive.

10.1 Parsing the file with h5py

The h5py package enables opening and navigating HDF5 files. It takes advantage of HDF5’s structure to enable access of the data on disk without copying the whole thing into RAM. We have imported h5py above and can use it to open the file.

with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    print(f)

<HDF5 file "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb" (mode r)>

Note that we have used h5py.File() to instantiate a File instance to get access to the data in the file. I am using context management here to keep me out of trouble for reading and writing files and possibly forgetting to close them.

After opening the file, I printed it, which is an HDF5 File instance. These behave much like dictionaries, and I can see the keys of the dictionary using the keys() method.

with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    print(f.keys())

<KeysViewHDF5 ['acquisition', 'analysis', 'file_create_date', 'general', 'identifier', 'intervals', 'processing', 'session_description', 'session_start_time', 'stimulus', 'timestamps_reference_time', 'units']>

We see that the HDF file contains groups acquisition, analysis, etc. It would be useful to crawl the whole hierarchical structure of the file, something like ls -R on the command line. This is possible using the visititems() method. It takes a single argument, which is a function you want applied to each item in the hierarchy of the file. To view what we have, the print() function will work fine.

with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    f.visititems(print)

acquisition <HDF5 group "/acquisition" (2 members)>
acquisition/events <HDF5 group "/acquisition/events" (2 members)>
acquisition/events/data <HDF5 dataset "data": shape (754,), type "|O">
acquisition/events/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
acquisition/experiment_ids <HDF5 group "/acquisition/experiment_ids" (2 members)>
acquisition/experiment_ids/data <HDF5 dataset "data": shape (754,), type "<f8">
acquisition/experiment_ids/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
analysis <HDF5 group "/analysis" (0 members)>
file_create_date <HDF5 dataset "file_create_date": shape (1,), type "|O">
general <HDF5 group "/general" (9 members)>
general/data_collection <HDF5 dataset "data_collection": shape (), type "|O">
general/devices <HDF5 group "/general/devices" (1 members)>
general/devices/Neuralynx-cheetah <HDF5 group "/general/devices/Neuralynx-cheetah" (0 members)>
general/experiment_description <HDF5 dataset "experiment_description": shape (), type "|O">
general/extracellular_ephys <HDF5 group "/general/extracellular_ephys" (11 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-1 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-1" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-10 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-10" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-11 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-11" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-13 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-13" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-15 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-15" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-3 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-3" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-4 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-4" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-5 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-5" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-6 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-6" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-7 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-7" (1 members)>
general/extracellular_ephys/electrodes <HDF5 group "/general/extracellular_ephys/electrodes" (10 members)>
general/extracellular_ephys/electrodes/filtering <HDF5 dataset "filtering": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group <HDF5 dataset "group": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group_name <HDF5 dataset "group_name": shape (10,), type "|O">
general/extracellular_ephys/electrodes/id <HDF5 dataset "id": shape (10,), type "<i4">
general/extracellular_ephys/electrodes/imp <HDF5 dataset "imp": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/location <HDF5 dataset "location": shape (10,), type "|O">
general/extracellular_ephys/electrodes/origChannel <HDF5 dataset "origChannel": shape (10,), type "<u2">
general/extracellular_ephys/electrodes/x <HDF5 dataset "x": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/y <HDF5 dataset "y": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/z <HDF5 dataset "z": shape (10,), type "<f8">
general/institution <HDF5 dataset "institution": shape (), type "|O">
general/keywords <HDF5 dataset "keywords": shape (7,), type "|O">
general/lab <HDF5 dataset "lab": shape (), type "|O">
general/related_publications <HDF5 dataset "related_publications": shape (1,), type "|O">
general/subject <HDF5 group "/general/subject" (5 members)>
general/subject/age <HDF5 dataset "age": shape (), type "|O">
general/subject/description <HDF5 dataset "description": shape (), type "|O">
general/subject/sex <HDF5 dataset "sex": shape (), type "|O">
general/subject/species <HDF5 dataset "species": shape (), type "|O">
general/subject/subject_id <HDF5 dataset "subject_id": shape (), type "|O">
identifier <HDF5 dataset "identifier": shape (), type "|O">
intervals <HDF5 group "/intervals" (1 members)>
intervals/trials <HDF5 group "/intervals/trials" (14 members)>
intervals/trials/category_name <HDF5 dataset "category_name": shape (150,), type "|O">
intervals/trials/delay1_time <HDF5 dataset "delay1_time": shape (150,), type "<f8">
intervals/trials/delay2_time <HDF5 dataset "delay2_time": shape (150,), type "<f8">
intervals/trials/external_image_file <HDF5 dataset "external_image_file": shape (150,), type "|O">
intervals/trials/id <HDF5 dataset "id": shape (150,), type "<i4">
intervals/trials/new_old_labels_recog <HDF5 dataset "new_old_labels_recog": shape (150,), type "|O">
intervals/trials/response_time <HDF5 dataset "response_time": shape (150,), type "<f8">
intervals/trials/response_value <HDF5 dataset "response_value": shape (150,), type "<f8">
intervals/trials/start_time <HDF5 dataset "start_time": shape (150,), type "<f8">
intervals/trials/stimCategory <HDF5 dataset "stimCategory": shape (150,), type "|u1">
intervals/trials/stim_off_time <HDF5 dataset "stim_off_time": shape (150,), type "<f8">
intervals/trials/stim_on_time <HDF5 dataset "stim_on_time": shape (150,), type "<f8">
intervals/trials/stim_phase <HDF5 dataset "stim_phase": shape (150,), type "|O">
intervals/trials/stop_time <HDF5 dataset "stop_time": shape (150,), type "<f8">
processing <HDF5 group "/processing" (0 members)>
session_description <HDF5 dataset "session_description": shape (), type "|O">
session_start_time <HDF5 dataset "session_start_time": shape (), type "|O">
stimulus <HDF5 group "/stimulus" (2 members)>
stimulus/presentation <HDF5 group "/stimulus/presentation" (1 members)>
stimulus/presentation/StimulusPresentation <HDF5 group "/stimulus/presentation/StimulusPresentation" (7 members)>
stimulus/presentation/StimulusPresentation/data <HDF5 dataset "data": shape (150, 400, 300, 3), type "|u1">
stimulus/presentation/StimulusPresentation/dimension <HDF5 dataset "dimension": shape (3,), type "<i4">
stimulus/presentation/StimulusPresentation/distance <HDF5 dataset "distance": shape (), type "<f8">
stimulus/presentation/StimulusPresentation/field_of_view <HDF5 dataset "field_of_view": shape (3,), type "<f8">
stimulus/presentation/StimulusPresentation/format <HDF5 dataset "format": shape (), type "|O">
stimulus/presentation/StimulusPresentation/orientation <HDF5 dataset "orientation": shape (), type "|O">
stimulus/presentation/StimulusPresentation/timestamps <HDF5 dataset "timestamps": shape (150,), type "<f8">
stimulus/templates <HDF5 group "/stimulus/templates" (0 members)>
timestamps_reference_time <HDF5 dataset "timestamps_reference_time": shape (), type "|O">
units <HDF5 group "/units" (11 members)>
units/IsolationDist <HDF5 dataset "IsolationDist": shape (33,), type "<f8">
units/SNR <HDF5 dataset "SNR": shape (33,), type "<f8">
units/electrodes <HDF5 dataset "electrodes": shape (33,), type "<i4">
units/electrodes_index <HDF5 dataset "electrodes_index": shape (33,), type "<i4">
units/id <HDF5 dataset "id": shape (33,), type "<i4">
units/origClusterID <HDF5 dataset "origClusterID": shape (33,), type "<i4">
units/spike_times <HDF5 dataset "spike_times": shape (118870,), type "<f8">
units/spike_times_index <HDF5 dataset "spike_times_index": shape (33,), type "<i4">
units/waveform_mean_encoding <HDF5 dataset "waveform_mean_encoding": shape (33, 256), type "<f8">
units/waveform_mean_recognition <HDF5 dataset "waveform_mean_recognition": shape (33, 256), type "<f8">
units/waveform_mean_sampling_rate <HDF5 dataset "waveform_mean_sampling_rate": shape (33, 1), type "<f8">

To access a group or dataset from the File instance, we index like a dictionary with the “path” to the object of interest. So, for example, say we wanted to find the time point when stimuli were presented to the subject, we can index as follows. (Going foward, we will not use context management just to avoid code clutter, and will close the File instance when we are done. Generally, though, you should use context management, as when doing file I/O.)

f = h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
)

f["intervals/trials/stim_on_time"]

<HDF5 dataset "stim_on_time": shape (150,), type "<f8">

Note that we see this as a dataset consisting 150 64-bit floating point numbers (f8). The data set has not yet been accessed, but it is available for use, almost exactly like a NumPy array. Importantly, they support much of Numpy’s fancy indexing and slicing. If we do want direct access of the entire array as a Numpy array, we index with an empty tuple [()] or [tuple()].

f['intervals/trials/stim_on_time'][()]

array([3018.31409 , 3023.236694, 3027.561148, 3031.890272, 3036.014586,
       3039.860668, 3043.831013, 3048.447768, 3053.086304, 3058.577719,
       3062.238007, 3067.642116, 3072.621551, 3077.625471, 3082.789226,
       3088.027263, 3092.089399, 3096.196786, 3100.801929, 3105.771749,
       3110.665814, 3114.849509, 3118.918742, 3123.272534, 3128.122392,
       3132.590985, 3137.111894, 3141.603956, 3147.717728, 3153.470202,
       3158.333761, 3168.646004, 3175.345146, 3179.651875, 3184.451598,
       3188.370026, 3192.887064, 3197.331327, 3201.374939, 3205.493293,
       3209.305061, 3213.320073, 3217.695492, 3222.133979, 3226.03333 ,
       3230.202833, 3234.002927, 3237.959018, 3241.884543, 3245.480595,
       3956.716701, 3963.177978, 3969.633632, 3974.25094 , 3978.264324,
       3984.13252 , 3989.92923 , 3995.180906, 4002.849509, 4008.402487,
       4012.75376 , 4017.317062, 4022.795421, 4031.293864, 4037.318977,
       4043.365779, 4047.332315, 4054.664626, 4061.639448, 4067.666958,
       4072.004561, 4084.804356, 4089.888424, 4093.922298, 4097.611186,
       4101.601407, 4116.682899, 4120.547506, 4124.680114, 4130.163327,
       4137.020737, 4144.22169 , 4148.946118, 4153.538451, 4157.395439,
       4161.793468, 4166.075529, 4175.888019, 4179.973042, 4184.478775,
       4188.875299, 4194.887817, 4198.663827, 4203.638532, 4207.17382 ,
       4214.310752, 4217.899585, 4221.665581, 4225.69116 , 4234.420955,
       4238.493628, 4242.376759, 4248.550496, 4253.654993, 4257.915795,
       4261.888014, 4267.130351, 4271.468937, 4277.210474, 4281.230217,
       4287.272472, 4291.701405, 4295.630678, 4299.617304, 4304.125188,
       4308.165789, 4312.056017, 4317.048938, 4321.668581, 4326.359709,
       4332.466845, 4339.217012, 4344.433944, 4348.149283, 4352.545253,
       4357.203112, 4361.00293 , 4365.944918, 4371.058385, 4375.244998,
       4378.87601 , 4384.071039, 4387.848954, 4392.15089 , 4396.037738,
       4401.323729, 4404.87579 , 4412.951187, 4417.785962, 4422.133795,
       4429.22354 , 4434.198583, 4440.647387, 4445.09426 , 4449.167026,
       4453.384943, 4458.401212, 4464.093628, 4472.946794, 4478.351456])

We can use Bokeh’s image visualization to see what the stimulus images were. For example, let look at image number 125. (For convenience, we will use the imshow() function of the bebi103 package.)

# Pull out first stimulus
stim_im = f['stimulus/presentation/StimulusPresentation/data'][125][()]

# View the image
bokeh.io.show(bebi103.image.imshow(stim_im))

Now, to do something interesting with the data, we can look at the number of spikes recorded from all electrodes while viewing a stimulus during training (before the subject has to answer whether or not he has seen the stimulus before), when the image is novel (not seen before) or familiar (seen before).

First, we pull out the information about the type of image.

stim_type = f['intervals/trials/new_old_labels_recog'][()]

# Take a look
stim_type

array([b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'0', b'1', b'0', b'1', b'1',
       b'1', b'0', b'1', b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'0',
       b'1', b'0', b'1', b'1', b'1', b'0', b'1', b'0', b'0', b'1', b'0',
       b'1', b'1', b'0', b'0', b'1', b'0', b'0', b'0', b'1', b'0', b'1',
       b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'0', b'1', b'1', b'0',
       b'0', b'1', b'0', b'1', b'0', b'1', b'1', b'0', b'1', b'0', b'0',
       b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'1', b'1', b'1', b'0',
       b'0', b'0', b'1', b'1', b'0', b'0', b'0', b'0', b'1', b'1', b'0',
       b'0', b'0', b'0', b'0', b'1', b'1', b'0', b'1', b'1', b'1', b'0',
       b'0', b'0', b'0', b'1', b'1', b'1', b'0'], dtype=object)

Here, NA means it was a training image, 0 means that the subject had not seen the image before, and 1 means that they had. Note also that these are all byte strings, which we should convert to strings to play nicely with Bokeh. We might as well make them more descriptive in the meantime.

rename = {b'NA': "training", b"1": "familiar", b"0": "novel"}
stim_type = np.array([rename[st] for st in stim_type])

t_on = f['intervals/trials/stim_on_time']
t_off = f['intervals/trials/stim_off_time']

Finally, we pull out the records of the spikes.

spike_times = f['units/spike_times'][()]

Now, we can compute a crude estimate for the spike rate as the number of observed spikes divided by the time the simulus was present for each stimulus.

spike_rate = np.empty_like(t_on)
for i, (t1, t2) in enumerate(zip(t_on, t_off)):
    spike_rate[i] = np.sum((spike_times > t1) & (spike_times < t2)) / (t2 - t1)

We can toss these data into a data frame for convenient plotting. It will help to change the values of the stimulus types to be more descriptive.

df = pl.DataFrame(
    data={
        "stimulus type": stim_type,
        "spike rate (Hz)": spike_rate,
        "stimulus index": np.arange(len(stim_type)),
    }
)

Now we can make a plot!

p = iqplot.ecdf(
    df,
    q='spike rate (Hz)',
    cats='stimulus type',
    tooltips=[('stim ind', '@{stimulus index}')]
) 

bokeh.io.show(p)

In looking at the ECDF, there seems to be a higher spiking rate during training than not, and no real difference between presentation of novel versus familiar stimuli. By hovering over the point with the highest spike rate, we see that it is image 78. Let’s take a look!

# Pull out stimulus 78
stim_im = f['stimulus/presentation/StimulusPresentation/data'][78][()]

# View the image
bokeh.io.show(bebi103.image.imshow(stim_im))

Oooo! A scary shark! It has scared us into remembering to close the file.

f.close()

10.2 PyNWB: A domain-specific tool

While directly using h5py works just fine, as we have seen, the neuroscience community has built their own package for handling NWB files, pynwb. It is already included in the datasai environment, as its dependencies are not onerous.

Instead of using dictionary-like indexing, pynwb enables access to data using functions specific to neuroscience data sets. For example, we can conveniently pull out the spike times of all spikes measured from a single neuron, say neuron with index 10. I deduced the syntax for doing this by reading the pynwb docs.

# Open the file
raw_io = pynwb.NWBHDF5IO(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), 
    "r"
)

# Read the content
nwb_in = raw_io.read()

# Pull out spike times of neuron with index 10
spike_times = nwb_in.units.get_unit_spike_times(10)

For many data types, domain specific tools exist and can be quite useful. I tend to only use them for parsing, slicing, accessing, and organizing data, but prefer to use my own bespoke methods for analysis.

10.3 Computing environment

%load_ext watermark
%watermark -v -p numpy,polars,h5py,pynwb,bokeh,bebi103,iqplot,jupyterlab

Python implementation: CPython
Python version       : 3.12.11
IPython version      : 9.1.0

numpy     : 2.1.3
polars    : 1.30.0
h5py      : 3.12.1
pynwb     : 3.0.0
bokeh     : 3.6.2
bebi103   : 0.1.27
iqplot    : 0.3.7
jupyterlab: 4.3.7