Lazy Data Practices Harm Fairness Research

Jan Simson; Alessandro Fabris; Christoph Kern

Lazy Data Practices Harm Fairness Research

Jan Simson, Alessandro Fabris, Christoph Kern

TL;DR

Addresses how data practices shape fair ML research by analyzing $N=280$ dataset usages across $N=142$ publications to reveal gaps in representation, preprocessing, and transparency. It collects and annotates dataset usage details, with case studies on COMPAS and Bank illustrating how processing choices shift base rates and fairness scores. The findings show underrepresentation of religion and disability, widespread omission of small subpopulations, and opaque usage documentation that undermine reproducibility. The paper offers actionable recommendations to improve data sourcing, inclusion, and transparent reporting, aiming to strengthen the reliability and relevance of fair ML research.

Abstract

Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.

Lazy Data Practices Harm Fairness Research

TL;DR

Addresses how data practices shape fair ML research by analyzing

dataset usages across

publications to reveal gaps in representation, preprocessing, and transparency. It collects and annotates dataset usage details, with case studies on COMPAS and Bank illustrating how processing choices shift base rates and fairness scores. The findings show underrepresentation of religion and disability, widespread omission of small subpopulations, and opaque usage documentation that undermine reproducibility. The paper offers actionable recommendations to improve data sourcing, inclusion, and transparent reporting, aiming to strengthen the reliability and relevance of fair ML research.

Abstract

Paper Structure (16 sections, 2 equations, 9 figures, 2 tables)

This paper contains 16 sections, 2 equations, 9 figures, 2 tables.

Introduction
Methodology
Neglected Identities
Protected Attributes Globally
Who is Missing
Omitted Populations
Opaque Preprocessing
Discussion
Recommendations
Conclusion
Annotations
Corpus selection
Annotation Process
Annotation Instructions
Robustness
...and 1 more sections

Figures (9)

Figure 1: There is a large discrepancy between the list of attributes considered protected under international legislation and their availability or usage in datasets. Bar chart displaying the availability (left) and usage (right) of protected attributes in the literature for all categories of protected attributes in Table \ref{['tab:sens']}. Availability based on a total of $N = 36$ datasets; usage based on a total of $N = 233$ experiments with enough information available to reconstruct (or at least make an educated guess about) protected attribute usage (see Section \ref{['sec:opaque']} regarding a lack of available information).
Figure 2: Data from smaller populations is almost always either discarded or aggregated within the annotated literature. (A) Prevalence of processing strategies for the COMPAS dataset within the annotated literature and (B) resulting base rates of the protected attribute from these different processing strategies. Due to the small sample sizes, the populations of Asians and Native Americans are difficult / impossible to see in the figure. Neither group is included as a category in any of the processing strategies except when using the Full Data ($n=1$). Processing strategies binarising protected attributes (i.e. leaving a binary variable with only two groups) are highlighted with a black outline in A. The inner circle corresponds to the combined prevalence of processing strategies using a specific approach (e.g. filtering or aggregation).
Figure 3: A large section of the annotated literature lacks sufficient information to reproduce analyses. Bar diagrams showing whether publications in the annotated literature contain (A) sufficient information to reconstruct usage of the predicted target variables $y$, the protected features $S$ and the features used for prediction $X$ and (B) source code to reproduce analyses. Only publications containing a prediction task are included in the figure.
Figure 4: The "same" dataset is used in many different ways within the literature. Sankey diagram illustrating the usage of the Bank dataset within the annotated literature. Each split corresponds to a choice where differences were observed in the literature. Each unique combination of choices or scenario is identified by a unique letter, with the base rates of the protected attribute(s) displayed on the right. We constructed this figure to provide a conservative, lower-bound estimate regarding the variation in dataset usage.
Figure 5: While a practitioner would choose roughly similar models based on performance across the different scenarios, they would choose very different ones based on fairness. Spearman's $\rho$ correlations of model ranks on a measure of fairness (Equalized Odds Difference) and performance (F1 score) between different scenarios. Letters correspond to scenarios described in Figure \ref{['fig:sankey']}.
...and 4 more figures

Lazy Data Practices Harm Fairness Research

TL;DR

Abstract

Lazy Data Practices Harm Fairness Research

Authors

TL;DR

Abstract

Table of Contents

Figures (9)