Table of Contents
Fetching ...

TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

Wiebke Hutiri, Mircea Cimpoi, Morgan Scheuerman, Victoria Matthews, Alice Xiang

TL;DR

TEDI addresses the gap in empirical analysis of trustworthy and ethical attributes in multimodal dataset documentation by introducing a verifiable $143$-indicator framework. The approach converts qualitative disclosures into a machine-auditable rubric and introduces a data-sourcing taxonomy to study how collection methods influence ethics and trustworthiness. Applying TEDI to around $114$ datasets with human voices reveals limited documentation of consent, privacy, and harmful content, with documentation more common for crowdsourced and direct collection than for scraped or derived data. This work enables scalable auditing and paves the way for automation and standardized reporting to improve dataset transparency.

Abstract

Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.

TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

TL;DR

TEDI addresses the gap in empirical analysis of trustworthy and ethical attributes in multimodal dataset documentation by introducing a verifiable -indicator framework. The approach converts qualitative disclosures into a machine-auditable rubric and introduces a data-sourcing taxonomy to study how collection methods influence ethics and trustworthiness. Applying TEDI to around datasets with human voices reveals limited documentation of consent, privacy, and harmful content, with documentation more common for crowdsourced and direct collection than for scraped or derived data. This work enables scalable auditing and paves the way for automation and standardized reporting to improve dataset transparency.

Abstract

Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.

Paper Structure

This paper contains 15 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Proportional dataset use per task for datasets from PapersWithCode
  • Figure 2: Data collection methods influence the size (i.e. recorded hours) of datasets.
  • Figure 3: Proportion of datasets that have considered trustworthy (bottom) and ethical (top) dataset indicators for primary data collection methods.
  • Figure 4: Research approach of this study, including dataset selection, annotation, and analysis.
  • Figure 5: Histograms of detailed data types for video, text and speech for primary, secondary and annotation modalities
  • ...and 5 more figures