Auditing Fairness by Betting
Ben Chugg, Santiago Cortes-Gomez, Bryan Wilder, Aaditya Ramdas
TL;DR
The paper addresses the challenge of auditing fairness for deployed classifiers and regressors under distribution shift and nonuniform data collection. It introduces a nonparametric, sequential auditing framework based on anytime-valid inference and a betting-based testing paradigm (Testing by Betting) that yields a level-$\alpha$ sequential test for group fairness. The authors extend the core method to handle time-varying data collection policies, unknown data densities, distribution drift, and composite nulls, providing finite-sample stopping-time bounds and asymptotic power 1. Empirical evaluations on credit default, census, and health insurance datasets show faster, more reliable detection of unfairness than fixed-time permutation tests, including under drift and policy changes. The approach is practical, interpretable, and accompanied by open-source code for practitioners to deploy in real-world auditing tasks.
Abstract
We provide practical, efficient, and nonparametric methods for auditing the fairness of deployed classification and regression models. Whereas previous work relies on a fixed-sample size, our methods are sequential and allow for the continuous monitoring of incoming data, making them highly amenable to tracking the fairness of real-world systems. We also allow the data to be collected by a probabilistic policy as opposed to sampled uniformly from the population. This enables auditing to be conducted on data gathered for another purpose. Moreover, this policy may change over time and different policies may be used on different subpopulations. Finally, our methods can handle distribution shift resulting from either changes to the model or changes in the underlying population. Our approach is based on recent progress in anytime-valid inference and game-theoretic statistics-the "testing by betting" framework in particular. These connections ensure that our methods are interpretable, fast, and easy to implement. We demonstrate the efficacy of our approach on three benchmark fairness datasets.
