Research Data in Scientific Publications: A Cross-Field Analysis
Puyu Yang, Giovanni Colavizza
TL;DR
The study addresses the question of how research data are released, reused, and referenced across disciplines by applying full-text NLP to the PubMed Open Access corpus and training a RoBERTa-based classifier to detect dataset-intent contexts. It reveals that data release is the dominant sharing mode across many fields, while data reuse is more prevalent in STEM disciplines and dataset referencing remains relatively rare, suggesting datasets are not yet fully recognized as research outputs. Temporal analysis indicates a post-2012 acceleration in data release, coupled with ongoing challenges in data discoverability and reuse compatibility. The work provides actionable insights for institutions and publishers to improve data accessibility and foster broader open science adoption across fields.
Abstract
Data sharing is fundamental to scientific progress, enhancing transparency, reproducibility, and innovation across disciplines. Despite its growing significance, the variability of data-sharing practices across research fields remains insufficiently understood, limiting the development of effective policies and infrastructure. This study investigates the evolving landscape of data-sharing practices, specifically focusing on the intentions behind data release, reuse, and referencing. Leveraging the PubMed open dataset, we developed a model to identify mentions of datasets in the full-text of publications. Our analysis reveals that data release is the most prevalent sharing mode, particularly in fields such as Commerce, Management, and the Creative Arts. In contrast, STEM fields, especially the Biological and Agricultural Sciences, show significantly higher rates of data reuse. However, the humanities and social sciences are slower to adopt these practices. Notably, dataset referencing remains low across most disciplines, suggesting that datasets are not yet fully recognized as research outputs. A temporal analysis highlights an acceleration in data releases after 2012, yet obstacles such as data discoverability and compatibility for reuse persist. Our findings can inform institutional and policy-level efforts to improve data-sharing practices, enhance dataset accessibility, and promote broader adoption of open science principles across research domains.
