Table of Contents
Fetching ...

Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework

Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj, Christoph Becker

TL;DR

This paper tackles the problem of opaque and potentially biased ML data practices by introducing a data curation lens and an evaluation framework. It develops a rubric with 19 dimensions across five groups, plus a supporting toolkit, to assess ML datasets through principles drawn from digital and data curation, archival ethics, and the FAIR data model. Applied to 25 NeurIPS datasets through four evaluation rounds, the study reveals challenges such as false friends in terminology, interpretative flexibility, and varying depth of analysis, while also showing improvements in inter-rater reliability as the toolkit evolves. The work contributes a concrete, interdisciplinary method to assess and improve dataset documentation and stewardship, enabling more transparent, fair, and accountable ML data practices and laying groundwork for broader adoption across ML research and review processes.

Abstract

Studies of dataset development in machine learning call for greater attention to the data practices that make model development possible and shape its outcomes. Many argue that the adoption of theory and practices from archives and data curation fields can support greater fairness, accountability, transparency, and more ethical machine learning. In response, this paper examines data practices in machine learning dataset development through the lens of data curation. We evaluate data practices in machine learning as data curation practices. To do so, we develop a framework for evaluating machine learning datasets using data curation concepts and principles through a rubric. Through a mixed-methods analysis of evaluation results for 25 ML datasets, we study the feasibility of data curation principles to be adopted for machine learning data work in practice and explore how data curation is currently performed. We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles. Our findings illustrate difficulties at the intersection of these fields, such as evaluating dimensions that have shared terms in both fields but non-shared meanings, a high degree of interpretative flexibility in adapting concepts without prescriptive restrictions, obstacles in limiting the depth of data curation expertise needed to apply the rubric, and challenges in scoping the extent of documentation dataset creators are responsible for. We propose ways to address these challenges and develop an overall framework for evaluation that outlines how data curation concepts and methods can inform machine learning data practices.

Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework

TL;DR

This paper tackles the problem of opaque and potentially biased ML data practices by introducing a data curation lens and an evaluation framework. It develops a rubric with 19 dimensions across five groups, plus a supporting toolkit, to assess ML datasets through principles drawn from digital and data curation, archival ethics, and the FAIR data model. Applied to 25 NeurIPS datasets through four evaluation rounds, the study reveals challenges such as false friends in terminology, interpretative flexibility, and varying depth of analysis, while also showing improvements in inter-rater reliability as the toolkit evolves. The work contributes a concrete, interdisciplinary method to assess and improve dataset documentation and stewardship, enabling more transparent, fair, and accountable ML data practices and laying groundwork for broader adoption across ML research and review processes.

Abstract

Studies of dataset development in machine learning call for greater attention to the data practices that make model development possible and shape its outcomes. Many argue that the adoption of theory and practices from archives and data curation fields can support greater fairness, accountability, transparency, and more ethical machine learning. In response, this paper examines data practices in machine learning dataset development through the lens of data curation. We evaluate data practices in machine learning as data curation practices. To do so, we develop a framework for evaluating machine learning datasets using data curation concepts and principles through a rubric. Through a mixed-methods analysis of evaluation results for 25 ML datasets, we study the feasibility of data curation principles to be adopted for machine learning data work in practice and explore how data curation is currently performed. We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles. Our findings illustrate difficulties at the intersection of these fields, such as evaluating dimensions that have shared terms in both fields but non-shared meanings, a high degree of interpretative flexibility in adapting concepts without prescriptive restrictions, obstacles in limiting the depth of data curation expertise needed to apply the rubric, and challenges in scoping the extent of documentation dataset creators are responsible for. We propose ways to address these challenges and develop an overall framework for evaluation that outlines how data curation concepts and methods can inform machine learning data practices.
Paper Structure (21 sections, 2 figures, 1 table)

This paper contains 21 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Multi-stage development and evaluation process of the rubric and toolkit
  • Figure 2: IRR across datasets and rounds