Table of Contents
Fetching ...

Correcting Underrepresentation and Intersectional Bias for Classification

Emily Diana, Alexander Williams Tolbert

TL;DR

This work tackles learning under underrepresentation and intersectional bias by modeling group-specific dropout probabilities and exploiting a two-batch setting that combines a small unbiased sample with a large biased dataset. It introduces a reweighting scheme using estimated bias inverses $\widehat{1/\beta_i}$ and $\widehat{\beta_0}$ to approximate true risk $L_{\mathcal{D}}(h)$ from biased observations, and provides PAC-style guarantees for hypothesis classes with finite VC dimension. The core contributions include a formal bias model, estimators for dropout rates, an ERM-based algorithm operating on biased data, and rigorous sample-complexity bounds that enable efficient agnostic PAC learning in this setting. Empirically, the method achieves population and group accuracies close to those obtained with unbiased data across multiple datasets, illustrating practical impact for fair classification under representation biases.

Abstract

We consider the problem of learning from data corrupted by underrepresentation bias, where positive examples are filtered from the data at different, unknown rates for a fixed number of sensitive groups. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out rates, even in settings where intersectional group membership makes learning each intersectional rate computationally infeasible. Using these estimates, we construct a reweighting scheme that allows us to approximate the loss of any hypothesis on the true distribution, even if we only observe the empirical error on a biased sample. From this, we present an algorithm encapsulating this learning and reweighting process along with a thorough empirical investigation. Finally, we define a bespoke notion of PAC learnability for the underrepresentation and intersectional bias setting and show that our algorithm permits efficient learning for model classes of finite VC dimension.

Correcting Underrepresentation and Intersectional Bias for Classification

TL;DR

This work tackles learning under underrepresentation and intersectional bias by modeling group-specific dropout probabilities and exploiting a two-batch setting that combines a small unbiased sample with a large biased dataset. It introduces a reweighting scheme using estimated bias inverses and to approximate true risk from biased observations, and provides PAC-style guarantees for hypothesis classes with finite VC dimension. The core contributions include a formal bias model, estimators for dropout rates, an ERM-based algorithm operating on biased data, and rigorous sample-complexity bounds that enable efficient agnostic PAC learning in this setting. Empirically, the method achieves population and group accuracies close to those obtained with unbiased data across multiple datasets, illustrating practical impact for fair classification under representation biases.

Abstract

We consider the problem of learning from data corrupted by underrepresentation bias, where positive examples are filtered from the data at different, unknown rates for a fixed number of sensitive groups. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out rates, even in settings where intersectional group membership makes learning each intersectional rate computationally infeasible. Using these estimates, we construct a reweighting scheme that allows us to approximate the loss of any hypothesis on the true distribution, even if we only observe the empirical error on a biased sample. From this, we present an algorithm encapsulating this learning and reweighting process along with a thorough empirical investigation. Finally, we define a bespoke notion of PAC learnability for the underrepresentation and intersectional bias setting and show that our algorithm permits efficient learning for model classes of finite VC dimension.
Paper Structure (56 sections, 25 theorems, 71 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 56 sections, 25 theorems, 71 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Lemma 2.1

The positive rate of samples belonging to a specific intersection of groups can be calculated from the marginal positive rates of those groups and the overall positive rate as, $\forall I \subseteq [k]$:

Figures (9)

  • Figure 1: Heatmaps of Pairwise Correlation (Top Row) and $p$-values for $\chi^2$ Test (Bottom Row)
  • Figure 2: Population Accuracy Observed over 100 Seeds for Each Model
  • Figure 3: Group Accuracies on Adult Data Set
  • Figure 4: Group Accuracies on ACS Employment Data Set
  • Figure 5: Reweighting to Approximate $\mathcal{D}$
  • ...and 4 more figures

Theorems & Definitions (54)

  • Lemma 2.1
  • Remark 2.1
  • Definition 2.1: Biased Base Positive Rate for Group $i$
  • Definition 2.2: Biased Base Positive Rate for Population
  • Lemma 2.2
  • Lemma 2.3
  • Remark 2.2
  • Theorem 2.1
  • Definition 2.3: True Loss
  • Remark 2.3
  • ...and 44 more