When Are Learning Biases Equivalent? A Unifying Framework for Fairness, Robustness, and Distribution Shift
Sushant Mehta
TL;DR
The paper tackles the problem of disparate bias phenomena in ML by proposing a unifying information-theoretic framework, defining bias as $\mathcal{B}(f; \mathcal{D}) = I(\hat{Y}; A \mid Y)$, and proving formal equivalence conditions that connect spurious correlations, subpopulation shifts, and fairness violations. It provides a concrete corollary linking spurious correlation strength $\alpha$ to an equivalent imbalance ratio $r$ and demonstrates that, under feature overlap and smooth loss assumptions, worst-group accuracy differences between equivalent problems are bounded by $\delta(\epsilon, \eta) = O(\sqrt{\epsilon}/\eta)$. The authors validate the theory across six datasets and three architectures, showing that predicted equivalences hold within about 3% for worst-group accuracy and enabling effective transfer of debiasing methods with minimal retraining. This work offers a principled path to diagnose, compare, and transfer bias mitigation techniques across fairness, robustness, and distribution-shift domains, potentially accelerating practical improvements in real-world ML systems.
Abstract
Machine learning systems exhibit diverse failure modes: unfairness toward protected groups, brittleness to spurious correlations, poor performance on minority sub-populations, which are typically studied in isolation by distinct research communities. We propose a unifying theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance. By formalizing biases as violations of conditional independence through information-theoretic measures, we prove formal equivalence conditions relating spurious correlations, subpopulation shift, class imbalance, and fairness violations. Our theory predicts that a spurious correlation of strength $α$ produces equivalent worst-group accuracy degradation as a sub-population imbalance ratio $r \approx (1+α)/(1-α)$ under feature overlap assumptions. Empirical validation in six datasets and three architectures confirms that predicted equivalences hold within the accuracy of the worst group 3\%, enabling the principled transfer of debiasing methods across problem domains. This work bridges the literature on fairness, robustness, and distribution shifts under a common perspective.
