Table of Contents
Fetching ...

Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness

Stephen R. Pfohl, Natalie Harris, Chirag Nagpal, David Madras, Vishwali Mhasawade, Olawale Salaudeen, Awa Dieng, Shannon Sequeira, Santiago Arciniegas, Lillian Sung, Nnamdi Ezeanochie, Heather Cole-Lewis, Katherine Heller, Sanmi Koyejo, Alexander D'Amour

TL;DR

The paper tackles the pitfalls of relying on disaggregated subgroup evaluations for algorithmic fairness, particularly under distribution shift and selection bias. By framing data-generating processes with causal DAGs, it derives how Bayes-optimal predictors behave across subgroups and under various shifts, identifying when fairness notions like sufficiency hold and when equalized performance cannot be expected. It then proposes controlled evaluations via variable-m-specific weighting to test conditional independencies, and validates these ideas through simulations and ACS PUMS real-data experiments. The work provides practical guidance for combining disaggregated metrics with causal assumptions and distribution-shift–aware analyses to more reliably assess and promote fairness in deployed systems.

Abstract

Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to characterize fairness properties and metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation.

Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness

TL;DR

The paper tackles the pitfalls of relying on disaggregated subgroup evaluations for algorithmic fairness, particularly under distribution shift and selection bias. By framing data-generating processes with causal DAGs, it derives how Bayes-optimal predictors behave across subgroups and under various shifts, identifying when fairness notions like sufficiency hold and when equalized performance cannot be expected. It then proposes controlled evaluations via variable-m-specific weighting to test conditional independencies, and validates these ideas through simulations and ACS PUMS real-data experiments. The work provides practical guidance for combining disaggregated metrics with causal assumptions and distribution-shift–aware analyses to more reliably assess and promote fairness in deployed systems.

Abstract

Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to characterize fairness properties and metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation.

Paper Structure

This paper contains 24 sections, 9 equations, 41 figures, 5 tables, 2 algorithms.

Figures (41)

  • Figure 1: Controlled evaluation for confounding across subgroups. Plotted are the statistics $T_{a}$ with 95% confidence intervals, corresponding to differences between unweighted disaggregated performance with population performance weighted to match the distribution of $X$, $Y$, or $R$ on the subgroups. The first row corresponds to subgroup-agnostic prediction and the second row corresponds to subgroup-aware prediction where $A$ is used as a covariate.
  • Figure B1: Simulation study: the effect of subgroup-aware prediction on model performance. We report the difference in performance between models that have access to subgroup membership as an additional covariate as compared to those that do not. Plotted are average differences with 95% confidence intervals for each setting and for several performance metrics.
  • Figure B2: Simulation study: calibration curves. Plotted are calibration curves for each subgroup with 95% confidence intervals. The first row corresponds to subgroup-agnostic prediction, the second row to prediction with $A$ as an additional covariate, and the third row to stratified prediction by $A$.
  • Figure B3: Simulation study: calibration with selection bias. Plotted are calibration curves for each subgroup with 95% confidence intervals. Models are fit in the selected population and evaluated in the full population without selection. The first row corresponds to subgroup-agnostic prediction, the second row to prediction with $A$ as an additional covariate, and the third row to stratified prediction by $A$.
  • Figure B4: Simulation study: controlled evaluation of log loss. Plotted are the statistics $T_{a}$ with 95% confidence intervals, corresponding to differences between the unweighted disaggregated performance with the population performance weighted to match the distribution of $X$, $Y$, or $R$ on the subgroups. The first row corresponds to subgroup-agnostic prediction, the second row to prediction with $A$ as an additional covariate, and the third row to stratified prediction by $A$.
  • ...and 36 more figures