Table of Contents
Fetching ...

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Xinyu Yang, Weixin Liang, James Zou

TL;DR

This study conducts a large-scale empirical analysis of Hugging Face dataset cards to characterize documentation practices and their impact on dataset usage. By processing 24,065 datasets and focusing on 7,433 non-empty dataset cards, it reveals heterogeneity in card completion, strong alignment with a community-endorsed five-section structure among popular datasets, and the notable emergence and impact of Usage content beyond templates. The work combines automated content detection, topic modeling, and human annotations to show that comprehensive documentation—especially in Description, Structure, and Usage—correlates with perceived quality and download activity, while Limitations and Social Impact are often underrepresented. The findings highlight practical implications for improving transparency, reproducibility, and accessibility of AI datasets, and provide a foundation for standards and tooling to enhance dataset documentation across platforms.

Abstract

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

TL;DR

This study conducts a large-scale empirical analysis of Hugging Face dataset cards to characterize documentation practices and their impact on dataset usage. By processing 24,065 datasets and focusing on 7,433 non-empty dataset cards, it reveals heterogeneity in card completion, strong alignment with a community-endorsed five-section structure among popular datasets, and the notable emergence and impact of Usage content beyond templates. The work combines automated content detection, topic modeling, and human annotations to show that comprehensive documentation—especially in Description, Structure, and Usage—correlates with perceived quality and download activity, while Limitations and Social Impact are often underrepresented. The findings highlight practical implications for improving transparency, reproducibility, and accessibility of AI datasets, and provide a foundation for standards and tooling to enhance dataset documentation across platforms.

Abstract

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.
Paper Structure (32 sections, 10 figures, 6 tables)

This paper contains 32 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Systematic Analysis of 24,065 Datasets Hosted on Hugging Face. ($\textbf{a}$) Exponential Growth of Datasets: The Hugging Face platform has seen a remarkable surge in the number of datasets, with the count doubling approximately every 18 weeks. ($\textbf{b}$) Power Law in Dataset Usage: Dataset downloads on Hugging Face follow a power-law distribution, as indicated by the linear relationship on the log-log plot. The top 82 datasets account for 80% of the total downloads; datasets with documentation dominate the top downloaded datasets. ($\textbf{c}$) Documentation Associated with Usage: Despite only 30.9% of dataset repositories (7,433 out of 24,065) featuring non-empty dataset cards, these datasets account for an overwhelming 95.0% of total download traffic on the platform.
  • Figure 2: Highly downloaded datasets consistently show better compliance with the community-endorsed documentation structure.
  • Figure 3: Section Length Reflects Practitioner Attention. ($\textbf{a}$) Popularity Correlates with Documentation Length: The top downloaded dataset cards are longer, indicating that they contain more comprehensive information. ($\textbf{b}$) Distribution of Word Count Among Top 100 Downloaded Dataset Cards ($\textbf{c}$) Section Length Proportions in Top 100 Downloaded Dataset Cards: The Dataset Description and Dataset Structure sections dominate in the top 100 downloaded dataset cards, with proportions of 36.2% and 33.6%, respectively. In contrast, the Considerations for Using the Data section receives the least attention, with a proportion of only 2.1%. ($\textbf{d}$) Section Length Proportion Changes over Downloads: The section length proportion changes over downloads, with Dataset Description and Dataset Structure decreasing in length, and Additional Information and Other increasing. Notably, there is a consistently low emphasis placed on the Dataset Creation and Considerations for Using the Data sections across all dataset cards with different downloads.
  • Figure 4: Highlighting the Hugging Face Community's Compliance with Subsection Guidelines. This figure shows subsection filled-out rates within different sections, stratified by download counts. Each section has multiple subsections, with bars representing the filled-out rate of each subsection. Green texts indicate filled-out rates above 50%, while red texts indicate rates below 50%. Of the 17 subsections within the five sections of the community-endorsed dataset, 14 have filled-out rates above 50%.
  • Figure 5: Key Topics in Considerations for Using the Data through Topic Modeling Analysis. This figure displays the outcomes of the topic modeling assessment on the contents of the ($\textbf{a}$) Social Impact of Dataset Subsection, ($\textbf{b}$) Discussion of Biases Subsection, and ($\textbf{c}$) Other Known Limitations Subsection. Each panel illustrates the human-assigned topic label and representative sentences for each section. Topics are generated by Latent Dirichlet Allocation (LDA).
  • ...and 5 more figures