Table of Contents
Fetching ...

From Reflection to Repair: A Scoping Review of Dataset Documentation Tools

Pedro Reynolds-Cuéllar, Marisol Wong-Villacres, Adriana Alvarado Garcia, Heila Precel

TL;DR

A systematic review supported by mixed-methods analysis of 59 dataset documentation publications is presented to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms.

Abstract

Dataset documentation is widely recognized as essential for the responsible development of automated systems. Despite growing efforts to support documentation through different kinds of artifacts, little is known about the motivations shaping documentation tool design or the factors hindering their adoption. We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms. Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization: unclear operationalizations of documentation's value, decontextualized designs, unaddressed labor demands, and a tendency to treat integration as future work. Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions, and outline actions the HCI community can take to enable sustainable documentation practices.

From Reflection to Repair: A Scoping Review of Dataset Documentation Tools

TL;DR

A systematic review supported by mixed-methods analysis of 59 dataset documentation publications is presented to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms.

Abstract

Dataset documentation is widely recognized as essential for the responsible development of automated systems. Despite growing efforts to support documentation through different kinds of artifacts, little is known about the motivations shaping documentation tool design or the factors hindering their adoption. We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms. Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization: unclear operationalizations of documentation's value, decontextualized designs, unaddressed labor demands, and a tendency to treat integration as future work. Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions, and outline actions the HCI community can take to enable sustainable documentation practices.
Paper Structure (44 sections, 7 figures, 1 table)

This paper contains 44 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Distribution of different types of tools over time based on the sample in our corpus. It is worth noting that 2025 is an outlier year given our sample was finalized in March of that year.
  • Figure 2: Distribution over time of different approaches to the production of dataset documentation across the six categories of tools in our sample
  • Figure 3: Comparative analysis of stakeholder engagement and integration pipelines including the percentage of proposals that specifically mentioned stakeholders as part of the design/use stages compared to the number of proposals from within that group that included any stakeholder in the process. It also showcases the percentage of proposals that included stakeholders during the design/testing stages compared to the number of proposals that included concrete features or guidance towards integration to stakeholders.
  • Figure 4: Distribution of approaches to the construction of dataset documentation over the years. Data shows a prevalence of manual approaches with a sustained increase in automated and semi-automated approaches between 2022 and 2025.
  • Figure 5: Comparative analysis of different types of tools that included tool evaluation or features towards integration. Left: Count of proposals that included an evaluation study at any stage of the design or integration of the tool. Right: Percentage of proposals that included stakeholders during the design/testing stages compared to the number of proposals that included concrete features or guidance towards integration to stakeholders.
  • ...and 2 more figures