Table of Contents
Fetching ...

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj, Christoph Becker

TL;DR

The paper introduces a data curation–inspired evaluation framework (rubric + toolkit) to audit dataset documentation within NeurIPS's Datasets and Benchmarks track and applies it to 60 datasets published from 2021 to 2023. It finds wide variability in documentation quality, with critical gaps in environmental footprint, ethics, and reflexivity, and only partial progress over time despite stricter submission guidelines. The authors demonstrate high inter‑rater reliability after iterative refinement and offer concrete strategies plus a peer‑review proposal to institutionalize rigorous data curation in ML. By providing a structured auditing approach and a dataset of curation results, the work aims to strengthen reusability, reproducibility, and responsible oversight in ML data practices.

Abstract

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

TL;DR

The paper introduces a data curation–inspired evaluation framework (rubric + toolkit) to audit dataset documentation within NeurIPS's Datasets and Benchmarks track and applies it to 60 datasets published from 2021 to 2023. It finds wide variability in documentation quality, with critical gaps in environmental footprint, ethics, and reflexivity, and only partial progress over time despite stricter submission guidelines. The authors demonstrate high inter‑rater reliability after iterative refinement and offer concrete strategies plus a peer‑review proposal to institutionalize rigorous data curation in ML. By providing a structured auditing approach and a dataset of curation results, the work aims to strengthen reusability, reproducibility, and responsible oversight in ML data practices.

Abstract

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.

Paper Structure

This paper contains 11 sections, 3 figures.

Figures (3)

  • Figure 1: Inter-rater reliability (IRR) (a) Across evaluation rounds, and (b) Within round 5 across rubric categories. Improvement of IRR across rounds and ultimate high IRR across categories provides evidence that the multi-stage quality and consistency process described in Sec. \ref{['methods']} was successful. In addition to this quantitative measure, we conducted qualitative participatory evaluations with reviewers in each round; see R1 and Appendix.
  • Figure 2: Percentage of completed documentation per dataset (a,b) and per element (c,d) in round 5 (i.e. after a multi-step iterative process to improve quality). In (a) we observe that the highest scoring dataset fulfilled 86% of criteria to meet the minimum standard of quality while the lowest fulfilled only 39%; in (b) for the standard of excellence we see similar spread (approximately 50% difference but lower attainment (highest fulfilled 50% of criteria, lowest two fulfill none of the criteria for excellence); see R2. In both (c) minimum standard and (d) excellence we observe that those elements more closely related to model-work (such as 'suitability' and 'reliability') are more consistently fulfilled; see R3.
  • Figure 3: Temporal distribution across years 2021-2023, (a) 'pass scores' for the minimum standard of quality and (b) 'full scores' for the standard of excellence across elements. In both cases there is no change across time; see R6.