Table of Contents
Fetching ...

The State of Documentation Practices of Third-party Machine Learning Models and Datasets

Ernesto Lang Oreamuno, Rohan Faiyaz Khan, Abdul Ali Bangash, Catherine Stinson, Bram Adams

TL;DR

The paper examines the state of documentation for third-party ML resources in Hugging Face stores, using scraping, hybrid card sorting, and deductive coding to assess Model Card and Dataset Card coverage against established standards. It finds substantial gaps: only about 39.62% of models and 28.48% of datasets are documented, with extensive omissions in ethics, caveats, and key metadata, and frequent misalignment between content and card sections. The results underscore a need for improved documentation practices and tooling to support responsible reuse and deployment of third-party ML artifacts, with implications for researchers, practitioners, and tool developers. The work provides a public dataset and actionable directions to strengthen documentation standards and enforceable practices in ML model stores.

Abstract

Model stores offer third-party ML models and datasets for easy project integration, minimizing coding efforts. One might hope to find detailed specifications of these models and datasets in the documentation, leveraging documentation standards such as model and dataset cards. In this study, we use statistical analysis and hybrid card sorting to assess the state of the practice of documenting model cards and dataset cards in one of the largest model stores in use today--Hugging Face (HF). Our findings show that only 21,902 models (39.62\%) and 1,925 datasets (28.48\%) have documentation. Furthermore, we observe inconsistency in ethics and transparency-related documentation for ML models and datasets.

The State of Documentation Practices of Third-party Machine Learning Models and Datasets

TL;DR

The paper examines the state of documentation for third-party ML resources in Hugging Face stores, using scraping, hybrid card sorting, and deductive coding to assess Model Card and Dataset Card coverage against established standards. It finds substantial gaps: only about 39.62% of models and 28.48% of datasets are documented, with extensive omissions in ethics, caveats, and key metadata, and frequent misalignment between content and card sections. The results underscore a need for improved documentation practices and tooling to support responsible reuse and deployment of third-party ML artifacts, with implications for researchers, practitioners, and tool developers. The work provides a public dataset and actionable directions to strengthen documentation standards and enforceable practices in ML model stores.

Abstract

Model stores offer third-party ML models and datasets for easy project integration, minimizing coding efforts. One might hope to find detailed specifications of these models and datasets in the documentation, leveraging documentation standards such as model and dataset cards. In this study, we use statistical analysis and hybrid card sorting to assess the state of the practice of documenting model cards and dataset cards in one of the largest model stores in use today--Hugging Face (HF). Our findings show that only 21,902 models (39.62\%) and 1,925 datasets (28.48\%) have documentation. Furthermore, we observe inconsistency in ethics and transparency-related documentation for ML models and datasets.
Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Model card section title prevalence across the 378 sampled model cards. A total of 2,053 sections were analyzed, out of which 19 remain un-categorized.
  • Figure 2: Content (Mis-)Match prevalence in the section-categories across 378 Model Cards. Omission means a sub-section was not present in the documentation.
  • Figure 3: Completeness of the content of Dataset Card sections. Percentage values rounded-up to IEC 60559 standard for presentation.
  • Figure 4: Content (Mis-)Match prevalence in the sub-sections of 321 Dataset Cards. Percentage values rounded-up to IEC 60559 standard for presentation.