How can we monitor where Leiden researchers deposit their data? A comparative analysis between open sources
Datasets are increasingly deposited and published by Leiden University researchers. But where do they do this and how can we find out? During her internship at the Leiden University Libraries’ Centre for Digital Scholarship and the CWTS, Korina Lemonidou researched this question.
Over the last years, an increase has been observed in the number of datasets both deposited and published by Leiden University researchers. As a result, the monitoring of these datasets has similarly gained traction, with monitoring tools increasingly becoming important devices in a university context. Therefore, Elsevier’s initiative to develop a Data Monitor was by no means unexpected; the particular tool "draws on millions of research data records stored in 2,000+ repositories to give [the] institution visibility of [its] entire research data output" (Elsevier n.d.). This phenomenon has been the starting point for various research initiatives, including the ‘Elsevier Data Monitor Pilot’ project, undertaken by the Centre for Digital Scholarship (CDS) of the Leiden University Libraries (Hettne et al. 2023). It was concluded that, while the Elsevier tool did have potential, its usability for Leiden University was limited at the time. In turn, the present research, a joint initiative between the CDS and the Centre for Science and Technology Studies (CWTS) Open Science Lab, aims to build upon this preexisting material and elaborate on the ‘suggested step’ of the Hettne et al. report to perform a comparative analysis between the Data Monitor and an open, non-commercial party. In this blog, I present the first results of the part of the study comparing three different open sources.
The open data sources selected for this project were Crossref, DataCite, and OpenAlex, accessed through the CWTS database. I compared these open sources to each other (and to the Elsevier Data Monitor) in terms of the metadata availability and quality of datasets provided by Leiden researchers, while also evaluating the opportunities and challenges inherent in each case. For this comparison, I decided to use specific evaluation criteria in accordance with the data management regulations of Leiden University (2021): the title, date of publication, author information, name identifier (primarily in terms of ORCID), affiliation name, affiliation faculty, repository name, repository type, data identifiers, versioning information, keywords and subject areas, research funding information, licensing and usage rights and access and distribution information of the datasets.
Crossref
The first source to be queried -and the most extensive one among the three with 33,000 entries for Leiden University- was Crossref. As an open source, Crossref retrieves its data mainly from its publisher members, which it then proceeds to make openly available via Application Programming Interfaces (APIs), while operating on a DOI-centric framework. Interestingly enough, it quickly became apparent that the term ‘dataset’ was not recognised as a valid publication type for Crossref, at least when accessed through the CWTS database,1 which lead to questions on the relevance of this open source for the research. The 33,000 Leiden entries relate to other outputs, like journal articles or conference papers. The lack of found datasets could also be explained by a lack of datasets deposited by its publisher members, issues concerning the storage of these datasets, limited awareness of the importance of monitoring datasets, and/or another factor.
DataCite
The second open data source used was DataCite. Encompassing all available data up to January 2024, a total of 14,255 datasets were identified as registered by Leiden University scholars. As an open source, DataCite predominantly operates in a DOI-centric framework, emphasising the allocation of Digital Object Identifiers (DOIs) to non-traditional research outputs, with a particular focus on datasets. The following bar chart showcases the metadata quality of the Leiden-related datasets that were available on DataCite, with a focus on availability and completeness.


Overall, the evaluation criteria were consistent with the metadata fields available in DataCite. The main challenges in the registries made by Leiden researchers alluded to consistency and systemisation issues in the fields that required manual input from the authors, showing a need for more systematic organisation, particularly for the fields labeled as ‘title’, ‘creator’, ‘creator affiliation’ and ‘subject’. This lack of systemisation ultimately hinders findability, such as in the case of Dutch names containing a suffix, which are not categorised properly and therefore are more difficult to locate. Additionally, merely 12% of Leiden-affiliated datasets included an ORCID identifier, while the incomplete status of the ‘license’ field raised questions on the validity of the quality of the data. These findings suggest that, although certain changes in terms of the completeness and the systemisation of the metadata are necessary, there is potential for sufficient quality and completeness of metadata on the DataCite open source in the future.
OpenAlex
OpenAlex was the last open-data source to be queried, containing a total of 1538 distinct datasets registered by Leiden University researchers as of August 2024 (2308 when the LUMC is also included). This open source strives to keep track of various forms of academic output including publications, datasets and dissertations, while also identifying the underlying connections among them by ‘finding associations through things like journals, authors, institutional affiliation, citations, concepts and funders’ (OpenAlex n.d.). In order to do so, OpenAlex tracks metadata from -interestingly enough- both Crossref and DataCite. The following bar chart showcases the metadata quality of the Leiden-related datasets that were available on OpenAlex, with a focus on availability and completeness.


Overall, the evaluation criteria did adhere to the available metadata fields within the OpenAlex database. The issues observed within DataCite were less visible in OpenAlex, as fields such as ‘title’, ‘author name’, ‘institution’ and ‘keyword’ exhibited complete consistency. The high availability of ORCID identifiers (94% for datasets and 85% for all Leiden-related deposited data) and the availability of Scopus identifiers, although low (20%), suggested an overall higher quality and completeness of the metadata on this open source.
Finally, the overlap between the datasets found in DataCite and OpenAlex was also studied by comparing their DOIs. The number of overlapping datasets found in both open sources came to 1388.

Discussion and conclusions
While reflecting on the findings derived from the three open sources, it appears that the current coverage of datasets is relatively limited. DataCite does emerge as a promising foundational resource for advancing quality of metadata for datasets, yet it lacks consistency and systemisation — attributes that are better achieved by OpenAlex. Nonetheless, this also implies that the monitoring of datasets by open sources is feasible and that there is fertile ground for the advancement of dataset monitoring practices in terms of metadata quality and completeness. The overlap between the open sources also suggests that there is a relationship between the sources that could be further harvested in order to achieve better metadata quality in the future.
Keeping all of this in mind, several conclusions can be drawn from this study. Firstly, not all open data aggregators, as exemplified by Crossref, stored datasets, a fact that highlights a still limited interest in —or perhaps awareness of— this specific resource type. Consequently, despite the rising popularity in monitoring datasets, it is clear that not all open sources are yet engaged in (or even aware of) the significance of this practice for open science in an institutional context. Furthermore, regarding open sources, and especially DataCite, there is an absence of systematic organisation and consistency in the metadata fields of datasets, a fact that could potentially hinder a general lack of instructions for researchers. This lack of consistency ultimately impedes both accessibility and metadata quality. The overlap observed between the open sources also suggested that there is a relationship between the sources that could be further harvested in order to achieve higher metadata quality in the future. Finally, the main conclusion of this research was that open sources such as DataCite and OpenAlex could potentially represent a more favourable and viable alternative for monitoring datasets in an institutional context compared to proprietary sources such as the Data Monitor.
Notes
1. The publication types that were available within the Crossref database included, namely, ‘book content’, ‘conference paper’, ‘journal article’, ‘peer review’, ‘posted content’ and ‘report-paper content’.
References and further reading
- A comprehensive report has been written by the author, which is available on request and is planned to be converted to an open access article.
- Crossref, ‘Metadata Retrieval’, https://www.crossref.org/pdfs/about-metadata-retrieval.pdf (accessed 7 November 2024).
- DataCite, ‘About DataCite’, https://datacite-metadata-schema.readthedocs.io/en/4.6/introduction/about-datacite/ (accessed 11 November 2024).
- Elsevier n.d., 'Data Monitor: Track and Analyze your Institutional Research Data Records', https://www.elsevier.com/produ... (accessed 4 November 2024; link no longer active at date of blog publication).
- Hettne, Kristina, Saskia Woutersen, Laurents Sesink, Fieke Schoots and Ben Companjen, ‘Elsevier Data Monitor Phase A Report’ (2023). Available on request (cds@library.leidenuniv.nl).
- Leiden University 2021, 'Leiden University Data Management Regulations'.
- OpenAlex n.d., 'About Us', https://help.openalex.org/hc/en-us/articles/24396686889751-About-us (accessed 12 November 2024).
This article has been reviewed and copy-edited by Pascal Flohr and Femmy Admiraal.
Banner photo by Kevin Ku: https://www.pexels.com/photo/d...