Data file formats

Data sets falling under the umbrella of neuroscience and otherwise come in a large variety of formats, being acquired using an even larger variety of instruments. The ability to read and parse data sets in different formats is a key part of data wrangling.

There is no real prescribed procedure to handle various data file formats, so I instead give a few tips here.

  1. If the file format is proprietary, obtain documentation from the instrument manufacturer or whoever produced the file format. If the documentation is not available, call the company to obtain assistance. If no information is available, avoid using the format; don’t guess.
  2. If you are using a proprietary format, try to find a way to convert the data set to an open format. Depending on what field you are working in, there may be automated converters available. For example, you can convert a lot of proprietary microscopy formats to the open OME-TIFF format using Bio-Formats.
  3. For open file formats, carefully read the documentation of the file format.
  4. If you need to read in file formats into data frames, array, or other familiar Python data type, there are often packages available to do so, and these are usually described in the documentation of the file format.
  5. When reading in data, make sure you get it all. Metadata are data that provide information about other data, presumably the “main” content of a data set. (I often do not make this distinction; metadata are data!) Be sure you know how to access the metadata within a given file format.
  6. If the format is niche (e.g., not something like CSV, TIFF, HDF-5, netCDF4, etc., for which there are widely used standard packages available) and you have to use format-specific package to read in the file format, it is a good idea to create a new conda environment for reading in the data. Niche format packages can often have lots of dependencies and may not be consistently updated, so you may get dependency clashes when you install the package.

These tips are not immediately helpful for any given file format. To demonstrate how to handle various file formats, I will give a few examples. Ultimately, though, it is your responsibility as an analyst of data to learn and understand the file format to make sure you completely and accurately can conveniently access a data set. I argue further that as a data producer, you are responsible for storing and sharing your data in an open, easily accessible, well-documented file format.

In my experience, for every file format I have encountered, I need to spend time reading a combination documentation and original research papers and need to play around with the file format to get the hang of it. This is just part of the game.

With that, I proceed to the examples.