Digital Scholarship@Leiden

Metadata 4 machines help you find and (re)use relevant research data

Metadata 4 machines help you find and (re)use relevant research data

On Oct 15-16, 2018, metadata researchers and specialists from around the world came together for the inaugural metadata for machines (M4M) workshop in Leiden.

Author: Kristina Hettne

On Oct 15-16, 2018, metadata researchers and specialists from around the world came together for the inaugural metadata for machines (M4M) workshop in Leiden. The goal of the workshop was to achieve convergence on metadata standards and tooling, among key stakeholders interested in Findable, Accessible, Interoperable and Reusable (FAIR) data and services.

If you are not so data-savvy the first question popping up in your mind might be "What is metadata?". Easily said, it is data about data. Consider a book. In this example, the book is the data and the information about the author of the book may be considered a piece of metadata. About half a century ago, we would store this metadata in a card catalogue. Nowadays, however, we have digital catalogues with search engines on top of these (the "machines" in metadata 4 machines). Following the book example, we realise how important it is for metadata to be correct, and also rich. Knowing the title of a book is not enough to be sure that you have found exactly the book you were looking for. If you also know the author you can be fairly sure that it is the correct one. Add publisher and date information and you can narrow down the search even more. Of course, knowing the International Standard Book Number (ISBN) will provide you with a unique identifier and therefore absolute certainty that it is the book you are looking for.

This type of precision is also crucial for research data, to facilitate search and to be sure that what you find is exactly what you are looking for. Therefore, when you deposit your data in a repository following the publication of your paper, you are usually also asked to provide metadata statements such as the list of authors of the original publication, and to give your dataset a name. Other metadata will be provided by the repository, such as a Digital Object Identifier (DOI), which is a unique reference to your data. But how do we decide what “enough” metadata is? The FAIR principles state that metadata should be “rich”, but with no definition of what “rich” is. The answer to the question about how much metadata to attach probably depends on the purpose for which the data are put in a repository. Is it only for it to be found and accessed? For this purpose a name and an unique identifier might be enough. Is the purpose reproducibility? That is, to make sure that the scientific experiment performed can be done again with the “recipe” (metadata) and “ingredients” (data) provided? Then we need to know more than just a name and an identifier. Is the purpose reuse? If so, then we have to anticipate that we do not know what someone else will do with the data, and therefore need to decide what the minimum amount of metadata should be in order to make it rich enough to allow someone to use it with another purpose in mind. Ideally, the data itself should also to be modeled in a interoperable way. The discussions in the M4M workshop centered around these topics, for example on how to define how much metadata is needed, what the metadata should look like and what tooling there are available to model and deposit data and metadata.

The first day of the workshop was dedicated to knowledge exchange, showcasing exciting metadata initiatives such those developed by CLARIN - a European Research Infrastructure for all disciplines but especially for humanities and social sciences, FAIR data point - a tool to expose datasets in compliance with the FAIR principles, CEDAR - a metadata repository for biomedicine, and FAIRsharing - a catalogue of data and metadata standards for all disciplines, inter-related to databases and data policies. These initiatives represent different flavours of infrastructures. CLARIN allows you to store and search for data and metadata, FAIR data point is a lightweight way to expose datasets in compliance with the FAIR principles, CEDAR provides tools to produce optimal metadata and FAIRsharing facilitates the search for a fitting service to deposit your data and metadata. At the Center for Digital Scholarship in Leiden we host a research data service catalogue with a similar goal as FAIRsharing, and discussions about possible collaboration took place.

The second day started with a gap analysis and a first discussion on how to bridge these gaps. These discussions were used as input for the afternoon, were we tackled specific challenges in more depth, such as how metadata should be described when writing a data management plan for a research project. This work resulted in a pilot project together with ZonMW that will be tackled in follow-up workshops.

The M4M workshop was organized by GoFAIR Leiden together with the Research Data Alliance, as one of the activities to kick start the European Open Science Cloud (EOSC).

Related