Table of Contents
Fetching ...

Auditing Fairness by Betting

Ben Chugg, Santiago Cortes-Gomez, Bryan Wilder, Aaditya Ramdas

TL;DR

The paper addresses the challenge of auditing fairness for deployed classifiers and regressors under distribution shift and nonuniform data collection. It introduces a nonparametric, sequential auditing framework based on anytime-valid inference and a betting-based testing paradigm (Testing by Betting) that yields a level-$\alpha$ sequential test for group fairness. The authors extend the core method to handle time-varying data collection policies, unknown data densities, distribution drift, and composite nulls, providing finite-sample stopping-time bounds and asymptotic power 1. Empirical evaluations on credit default, census, and health insurance datasets show faster, more reliable detection of unfairness than fixed-time permutation tests, including under drift and policy changes. The approach is practical, interpretable, and accompanied by open-source code for practitioners to deploy in real-world auditing tasks.

Abstract

We provide practical, efficient, and nonparametric methods for auditing the fairness of deployed classification and regression models. Whereas previous work relies on a fixed-sample size, our methods are sequential and allow for the continuous monitoring of incoming data, making them highly amenable to tracking the fairness of real-world systems. We also allow the data to be collected by a probabilistic policy as opposed to sampled uniformly from the population. This enables auditing to be conducted on data gathered for another purpose. Moreover, this policy may change over time and different policies may be used on different subpopulations. Finally, our methods can handle distribution shift resulting from either changes to the model or changes in the underlying population. Our approach is based on recent progress in anytime-valid inference and game-theoretic statistics-the "testing by betting" framework in particular. These connections ensure that our methods are interpretable, fast, and easy to implement. We demonstrate the efficacy of our approach on three benchmark fairness datasets.

Auditing Fairness by Betting

TL;DR

The paper addresses the challenge of auditing fairness for deployed classifiers and regressors under distribution shift and nonuniform data collection. It introduces a nonparametric, sequential auditing framework based on anytime-valid inference and a betting-based testing paradigm (Testing by Betting) that yields a level- sequential test for group fairness. The authors extend the core method to handle time-varying data collection policies, unknown data densities, distribution drift, and composite nulls, providing finite-sample stopping-time bounds and asymptotic power 1. Empirical evaluations on credit default, census, and health insurance datasets show faster, more reliable detection of unfairness than fixed-time permutation tests, including under drift and policy changes. The approach is practical, interpretable, and accompanied by open-source code for practitioners to deploy in real-world auditing tasks.

Abstract

We provide practical, efficient, and nonparametric methods for auditing the fairness of deployed classification and regression models. Whereas previous work relies on a fixed-sample size, our methods are sequential and allow for the continuous monitoring of incoming data, making them highly amenable to tracking the fairness of real-world systems. We also allow the data to be collected by a probabilistic policy as opposed to sampled uniformly from the population. This enables auditing to be conducted on data gathered for another purpose. Moreover, this policy may change over time and different policies may be used on different subpopulations. Finally, our methods can handle distribution shift resulting from either changes to the model or changes in the underlying population. Our approach is based on recent progress in anytime-valid inference and game-theoretic statistics-the "testing by betting" framework in particular. These connections ensure that our methods are interpretable, fast, and easy to implement. We demonstrate the efficacy of our approach on three benchmark fairness datasets.
Paper Structure (24 sections, 4 theorems, 55 equations, 5 figures, 1 algorithm)

This paper contains 24 sections, 4 theorems, 55 equations, 5 figures, 1 algorithm.

Key Result

Proposition 1

Algorithm alg:testing_by_betting with input $\alpha\in(0,1)$ and betting strategy eq:strategy-iid is a level-$\alpha$ sequential test with asymptotic power one. Moreover, letting $\Delta = |\mu_0-\mu_1|$, under the alternative the expected stopping time $\tau$ obeys

Figures (5)

  • Figure 1: Testing group fairness by betting
  • Figure 2: Two illustrations of our sequential test adapting to distribution shift. In both settings, the observations at time $t$ are Bernoulli with bias determined by the respective mean at that time. Shaded areas in the bottom plots represent the standard deviation across 100 trials. Left: For the first 100 time steps, we have $\mu_0(t)=\mu_1(t)=0.3$. At time $100$, $\mu_1(t)$ begins to smoothly slope upward. Right: Here we assume that both means are non-stationary and non-smooth. Both are sinusoidal with Gaussian noise but $\mu_1(t)$ drifts slowly upwards over time.
  • Figure 3: Comparisons of false positives rates (FPRs) and stopping times on credit loan data and US census data. The left two columns plot $\tau$ under $H_1$ versus the FPR as $\alpha$ is varied from 0.1 to 0.01. The FPR is grossly inflated when under method M1, as illustrated the first and third columns. Betting is a Pareto improvement over the permutation tests.
  • Figure 4: Response of the tests to distribution shift on the census data. We use a fair model for 400 timesteps, after which we switch to an unfair model with $\Delta\approx 1.0$. The leftmost plot uses permutation tests under M1, resulting in inflated type-I error. Values are plotted as $\alpha$ ranges from 0.01 to 0.1.
  • Figure 5: Illustration of our betting method when using various data collection strategies, $\pi_1$, $\pi_2$, and $\pi_3$, which were based on each individual's region (NE, NW, SE, SW). Operationally, $\pi_i$ samples an individual by first sampling a region with the given probability, and then sampling an individual uniformly at random from that region. We compare results against our method with data sampled uniformly from the population (red crosses), and to permutation tests (M2), also with uniformly sampled data. Even with randomized policies, we continue to outperform permutation tests.

Theorems & Definitions (5)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4