Digital Scholarship@Leiden

Easy ways to improve your research data management: file formats Image: Vecteezy.com / iconvector

Easy ways to improve your research data management: file formats

Research data management can seem daunting and time consuming. In this series we introduce some -relatively- straightforward ways of improving your data management, starting with file formats.

Research data management can seem like yet another thing on the already too long to-do list. Here at the Leiden University Libraries, Centre for Digital Scholarship, we think that all research data management is worth doing and often saves time in the long term. There are certainly data management aspects that are more straightforward and quicker to implement than others. As a first step, and first blog in this series, we take a look at file formats.

Files are saved in different formats, standard ways to encode information for digital storage. You can see what the file format is by looking at the file extension, such as .docx or .pdf. A programme often has a default file format – like .docx for MS Word – but you normally have multiple options to choose from for the same type of file. For example, you can save a document as .docx or .doc, but also as .pdf, .rtf, .txt, and you can save an image as a .jpg, .tiff, .png, and much more. So, which one should you choose, and does it matter?

Pexels cottonbro 5473318
Image: Pexels / Cottonbro Studio

To start with the latter, yes it does matter. Different file formats have different characteristics and different conditions attached to them. In terms of data management, an important distinction is made between proprietary formats, which are owned by a company or organisation and may require specific software, and non-proprietary or open formats, which are publicly accessible. It is one thing to use these proprietary formats while doing your research and knowing that at that point in time you have access to the relevant software, but another if you would like your files to be accessible and usable in the future by everyone (including by yourself). Aside from losing access to the paid software, what will happen, for example, if the organisation owning the file format ceases to exist?

So which file format to use? As a rule-of-thumb, use file formats which are:

  • Frequently used (a file format that other researchers in the same discipline use as well)
  • Have open specifications (i.e. they are non-proprietary)
  • Are independent of specific software, developers, or vendors (i.e. a file format with long-term stability).1

To find out what this means concretely for different data types, you can check lists of recommended formats by repositories, like the DANS preferred formats list or the UK Data Service list of recommended formats. Using these formats gives the best guarantee that your data will remain usable and accessible in the long term. They also make your data better interoperable and reusable (the ‘I’ and ‘R’ out of FAIR – see here to learn more about FAIR data). In practice, it may not be possible to use the perfect format. It is often the case that a format that is most used in your discipline is not an open format (for example formats generated by proprietary software like SPSS or ArcGIS). Converting to another format may result in files being less interoperable with other research and in loss of some information. In this case, the best thing would be to keep both the proprietary file and a converted, open file.

Something else to keep in mind is that there is not one perfect format for all research stages. The format that data is generated in will often be different from that produced by data analysis, while yet another format may be better suitable for long-term archiving or publication. For example, during laboratory analyses the equipment may produce tabular data in its machine-specific format, which you could then export as a csv, import into Excel, analyse in R (producing perhaps more .csv or .xlsx files, graphs, R scripts, and more), and finally store as a set of ‘preferred’ file formats (e.g. converting .xlsx to .csv).

It is fine to not use preferred file formats throughout the whole research data cycle. It is, however, important to plan ahead and keep your aims and objectives in mind. For example, if you have an image in a raw format like .dng, you could work with it in the smaller, but ‘lossy’ .jpg format, but store the larger .tiff format for long-term preservation and reuse. However, if you start off with capturing photos as .jpg, you can no longer have a larger, ‘better’ format like .dng or .tiff. It depends on the short- and long-term, actual and potential research aims if this matters. Perhaps your current research focuses simply on recording overall tomb structures and the .jpg format is sufficient. On the other hand, you or someone else may in future want to study minor details of small cracks in the structure of the tombs to aide in their conservation. In this case the fine details of the larger .dng/.tiff image formats are required.

In sum

  • Different research stages have different formats that are best suitable.
  • At least for long-term storage, if you can, use a format that is commonly used in your field, open and non-proprietary, and good for long-term sustainability (e.g. not ‘lossy'). Check it against a repository’s ‘preferred formats’ list.
  • If the file format that is preferred in your field for use during data collection or analysis is a proprietary format, aim to archive and publish both the original file as well as a file converted to a non-proprietary, accessible format.
  • Plan ahead and keep aims and objectives in mind.

Footnotes

1 DANS (2023). File formats. Version 1.1. https://dans.knaw.nl/en/file-formats/.

More information

Learn more about file formats in our online event 'Connect & Preserve - File formats: for what? Taking data files into the future' with speaker Valentijn Gilissen of DANS, taking place on 11 July 2024.

Related