Table of Contents
Fetching ...

Completeness of Datasets Documentation on ML/AI repositories: an Empirical Investigation

Marco Rondina, Antonio Vetrò, Juan Carlos De Martin

TL;DR

The paper investigates the completeness of dataset documentation across leading ML/AI repositories using a structured Documentation Test Sheet (DTS) derived from Datasheets for Datasets. It applies the DTS to 100 datasets from Hugging Face, Kaggle, OpenML, and UC Irvine, revealing that most documentations are less than 50% complete, with usage information being most common and data collection/processing details often missing. The study highlights substantial transparency gaps and demonstrates how repository metadata schemes influence documentation completeness. It argues for broader adoption of standardized, machine-readable metadata and DST-driven checks by dataset hosts and creators to enhance transparency, reproducibility, and responsible ML/AI practices.

Abstract

ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema -- the Documentation Test Sheet (DTS) -- that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the DTS to investigate which information was present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.

Completeness of Datasets Documentation on ML/AI repositories: an Empirical Investigation

TL;DR

The paper investigates the completeness of dataset documentation across leading ML/AI repositories using a structured Documentation Test Sheet (DTS) derived from Datasheets for Datasets. It applies the DTS to 100 datasets from Hugging Face, Kaggle, OpenML, and UC Irvine, revealing that most documentations are less than 50% complete, with usage information being most common and data collection/processing details often missing. The study highlights substantial transparency gaps and demonstrates how repository metadata schemes influence documentation completeness. It argues for broader adoption of standardized, machine-readable metadata and DST-driven checks by dataset hosts and creators to enhance transparency, reproducibility, and responsible ML/AI practices.

Abstract

ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema -- the Documentation Test Sheet (DTS) -- that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the DTS to investigate which information was present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.

Paper Structure

This paper contains 14 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Distribution of Dataset Presence Averages grouped by repository.
  • Figure 2: Section Presence Averages.