Dataset Representativeness and Downstream Task Fairness

Victor Borza; Andrew Estornell; Chien-Ju Ho; Bradley Malin; Yevgeniy Vorobeychik

Dataset Representativeness and Downstream Task Fairness

Victor Borza, Andrew Estornell, Chien-Ju Ho, Bradley Malin, Yevgeniy Vorobeychik

TL;DR

This work investigates how dataset representativeness interacts with downstream classifier fairness in multi-site data collection. It introduces a convex, bandit-based framework (PBRS and Distributed PBRS) to construct representative datasets from heterogeneous sites, and a fair-arm sampling approach to improve minmax fairness during data collection. Through theoretical insights from a univariate case, and extensive experiments across six real-world datasets, the authors show that improving representativeness does not guarantee fairness and that over-sampling minority groups can sometimes worsen bias; conversely, increasing model complexity can mitigate unfairness by enabling learning of more complex relationships. The findings highlight a nuanced trade-off between representativeness and fairness, emphasizing careful dataset- and model-design choices for multi-site data collection and downstream decision tasks.

Abstract

Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers.

Dataset Representativeness and Downstream Task Fairness

TL;DR

Abstract

Paper Structure (31 sections, 4 theorems, 14 equations, 32 figures, 1 table, 4 algorithms)

This paper contains 31 sections, 4 theorems, 14 equations, 32 figures, 1 table, 4 algorithms.

Introduction
Related Work
Preliminaries
Convex Formulation and Prior-Based Sampling
Prior-based Bayesian Representative Sampling (PBRS)
Distributed Prior-based Bayesian Representative Sampling (D-PBRS)
Fair Arm-Based Sampling
Univariate Case Study
Methodology
Datasets
Sampling Procedure and Algorithms
Site Variations
Arm Sampling and Downstream Fairness
Fairness and Complexity Analysis
Experimental Results
...and 16 more sections

Key Result

Theorem 1

The objective in Equation eq:obj2 is convex with respect to the sample values $\textrm{avg}(\boldsymbol{A}^{(t)})$ and has an equivalent optimal value with Equation eq:obj after all $T$ rounds are completed.

Figures (32)

Figure 1: Dataset representativeness in the Intensive Care dataset measured by distance between cohort sensitive feature means and target vector $\boldsymbol{v} = \langle .5, \cdots, .5 \rangle$ as the cohort is constructed the no-bias case (a), for the final cohort in the non-causal response bias case (b), and for the final cohort in the causal distribution shift case (c). Our proposed algorithms BY(H), BY(L), DS(H), and DS(L) outperform baseline sampling algorithms. Shaded regions indicate 95% confidence intervals.
Figure 2: Population (purple) and subgroup (red and blue) AUCs for gradient-boosted classifiers in the Intensive Care dataset. Each column represents an analysis studying group proportions by one sensitive feature: (a), (d), (g) for ethnicity; (b), (e), (h) for age; and (c), (f), (i) for gender. Green points indicate the difference in subgroup AUCs (AUC$_{G_0} -$AUC$_{G_1}$). Circles and shaded regions indicate quantile means and 95% CIs for performance of representativeness-based samplers with varying $G_1$ proportions, while outlined triangles and hexagons with error bars indicate means and 95% CIs for fairness-based samplers. The orange shading indicates the range of group $G_1$ proportions at each site. Subfigures (a-c) show classifier performance when training datasets are constructed by sampling arms with OPT, subfigures (d-f) for sampling arms with D-PBRS, and subfigures (g-i) for sampling directly from all training data to achieve a desired group proportion mix (stratified random sampling).
Figure 3: There is significant unfairness by race (a, d, g), age (b, e, h), and gender (c, f, i) in the Adult Income dataset. Population (purple) and subgroup (red and blue) AUCs (a-c), TPRs (d-f), TNRs (g-i) and 95% CIs are plotted for varying $G_1$ proportions.
Figure 4: Increasing model complexity improves fairness by TPR parity for gender in the Adult Income dataset. Darker green lines indicate higher maximum tree depth for the GBC (higher complexity) and the x-axis shows number of estimation steps, with more indicating higher complexity. Shaded regions indicate 95% CIs.
Figure 5: Increasing model complexity improves AUC (a-c), TPR (d-f), and TNR (g-i) parity for gradient boosted classifiers. Results are for the Adult Income dataset treating gender as the sensitive feature of interest. Darker red and blue colors indicate disparate performance favoring group $G_0$ and $G_1$, respectively, while paler colors indicate measure parity (fairness). Within each subfigure, rows represent maximum individual tree depths and columns indicate numbers of estimation steps.
...and 27 more figures

Theorems & Definitions (11)

Definition 1
Theorem 1
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof
proof : Proof of Theorem \ref{['thm:convex']}
proof : Proof of Theorem \ref{['thm:unfair_1']}
...and 1 more

Dataset Representativeness and Downstream Task Fairness

TL;DR

Abstract

Dataset Representativeness and Downstream Task Fairness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (32)

Theorems & Definitions (11)