On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning
Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
TL;DR
This study probes the readiness of scientific data documentation for fair and transparent use in ML by auditing 4041 data papers from two journals against ML-driven documentation expectations and a NeurIPS D&B benchmark. Using an LLM-based extraction pipeline, it shows that while core descriptive and usage dimensions are widely documented, critical aspects like generalization limits, social concerns, and maintenance policies are inconsistently reported, especially outside ML-focused venues. The authors provide actionable guidelines to improve submission rules, introduce provenance and maintenance reporting, and advocate machine-readable metadata schemas to enhance discoverability and ML applicability. Overall, the work highlights substantial gaps between current scientific data practices and ML readiness, and offers a concrete path to align data publishing with responsible ML deployment.
Abstract
To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions' adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years. We focus on the most and least documented dimensions and compare the results with those of an ML-focused venue (NeurIPS D&B track) publishing papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.
