Table of Contents
Fetching ...

On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning

Joan Giner-Miguelez, Abel Gómez, Jordi Cabot

TL;DR

This study probes the readiness of scientific data documentation for fair and transparent use in ML by auditing 4041 data papers from two journals against ML-driven documentation expectations and a NeurIPS D&B benchmark. Using an LLM-based extraction pipeline, it shows that while core descriptive and usage dimensions are widely documented, critical aspects like generalization limits, social concerns, and maintenance policies are inconsistently reported, especially outside ML-focused venues. The authors provide actionable guidelines to improve submission rules, introduce provenance and maintenance reporting, and advocate machine-readable metadata schemas to enhance discoverability and ML applicability. Overall, the work highlights substantial gaps between current scientific data practices and ML readiness, and offers a concrete path to align data publishing with responsible ML deployment.

Abstract

To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions' adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years. We focus on the most and least documented dimensions and compare the results with those of an ML-focused venue (NeurIPS D&B track) publishing papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.

On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning

TL;DR

This study probes the readiness of scientific data documentation for fair and transparent use in ML by auditing 4041 data papers from two journals against ML-driven documentation expectations and a NeurIPS D&B benchmark. Using an LLM-based extraction pipeline, it shows that while core descriptive and usage dimensions are widely documented, critical aspects like generalization limits, social concerns, and maintenance policies are inconsistently reported, especially outside ML-focused venues. The authors provide actionable guidelines to improve submission rules, introduce provenance and maintenance reporting, and advocate machine-readable metadata schemas to enhance discoverability and ML applicability. Overall, the work highlights substantial gaps between current scientific data practices and ML readiness, and offers a concrete path to align data publishing with responsible ML deployment.

Abstract

To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions' adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness, coverage of the requested dimensions, and trends in recent years. We focus on the most and least documented dimensions and compare the results with those of an ML-focused venue (NeurIPS D&B track) publishing papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.
Paper Structure (13 sections, 11 figures, 3 tables)

This paper contains 13 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Number of data papers published between 2015 and 2023 evaluated in the sample. 2023 has been evaluated until June.
  • Figure 2: Diversity in the collection and annotation teams type of the analyzed data papers
  • Figure 3: Collection: Diversity of types of collection processes
  • Figure 4: Annotation: Diversity of types of annotation processes
  • Figure 5: Overall results of informed dimensions. Social concerns and Profile of the collection targets dimension have been evaluated only on datasets gathered from or describing people (16,5% of the sample). Speech context in language datasets has only been assessed on datasets representing natural language (5,15% of the sample). Annotation dimensions have been assessed only on datasets created through an annotation process (42,28% of the sample). In these cases, the percentage reflects the occurrence of those dimensions relative to the number of papers that should declare them.
  • ...and 6 more figures