Table of Contents
Fetching ...

A structured regression approach for evaluating model performance across intersectional subgroups

Christine Herlihy, Kimberly Truong, Alexandra Chouldechova, Miroslav Dudik

TL;DR

This work tackles disaggregated evaluation of AI fairness across intersectional subgroups, where standard per-group estimates become unreliable when subgroup sizes are small. It introduces a structured regression (SR) framework that models subgroup performance as $\mu_a = \theta_0 + \boldsymbol{\theta}\cdot\boldsymbol{\phi}^a$ and uses inverse-variance weighted LASSO with a pooled variance estimate to borrow strength across groups, producing shrunk but unbiased estimates $\hat{\mu}_a$ with calibrated confidence intervals via residual bootstrap (rBLPR). The approach also enables goodness-of-fit testing to distinguish additive versus interaction structure and identify benign factors driving variation. Empirical results on diabetes and ASR datasets show SR (and related estimators JS/EB) substantially improve point estimates and interval coverage over standard methods and MBM, especially for small subgroups, while goodness-of-fit analyses illuminate the factors shaping fairness-related harms. The methodology provides a practical, interpretable tool for reliable disaggregated fairness assessment and data-driven mitigation planning.

Abstract

Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups defined by combinations of demographic or other sensitive attributes. The standard approach is to stratify the evaluation data across subgroups and compute performance metrics separately for each group. However, even for moderately-sized evaluation datasets, sample sizes quickly get small once considering intersectional subgroups, which greatly limits the extent to which intersectional groups are included in analysis. In this work, we introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups. We provide corresponding inference strategies for constructing confidence intervals and explore how goodness-of-fit testing can yield insight into the structure of fairness-related harms experienced by intersectional groups. We evaluate our approach on two publicly available datasets, and several variants of semi-synthetic data. The results show that our method is considerably more accurate than the standard approach, especially for small subgroups, and demonstrate how goodness-of-fit testing helps identify the key factors that drive differences in performance.

A structured regression approach for evaluating model performance across intersectional subgroups

TL;DR

This work tackles disaggregated evaluation of AI fairness across intersectional subgroups, where standard per-group estimates become unreliable when subgroup sizes are small. It introduces a structured regression (SR) framework that models subgroup performance as and uses inverse-variance weighted LASSO with a pooled variance estimate to borrow strength across groups, producing shrunk but unbiased estimates with calibrated confidence intervals via residual bootstrap (rBLPR). The approach also enables goodness-of-fit testing to distinguish additive versus interaction structure and identify benign factors driving variation. Empirical results on diabetes and ASR datasets show SR (and related estimators JS/EB) substantially improve point estimates and interval coverage over standard methods and MBM, especially for small subgroups, while goodness-of-fit analyses illuminate the factors shaping fairness-related harms. The methodology provides a practical, interpretable tool for reliable disaggregated fairness assessment and data-driven mitigation planning.

Abstract

Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups defined by combinations of demographic or other sensitive attributes. The standard approach is to stratify the evaluation data across subgroups and compute performance metrics separately for each group. However, even for moderately-sized evaluation datasets, sample sizes quickly get small once considering intersectional subgroups, which greatly limits the extent to which intersectional groups are included in analysis. In this work, we introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups. We provide corresponding inference strategies for constructing confidence intervals and explore how goodness-of-fit testing can yield insight into the structure of fairness-related harms experienced by intersectional groups. We evaluate our approach on two publicly available datasets, and several variants of semi-synthetic data. The results show that our method is considerably more accurate than the standard approach, especially for small subgroups, and demonstrate how goodness-of-fit testing helps identify the key factors that drive differences in performance.
Paper Structure (14 sections, 31 equations, 8 figures, 3 tables)

This paper contains 14 sections, 31 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Point estimates and 95% confidence intervals of selection rate (SEL) and false negative rate (FNR) on diabetes data. Confidence intervals of the standard estimator are calculated using pooled variance (see Eqs. \ref{['eq:pooled']} and \ref{['eq:pooled:CI']}).
  • Figure 2: Bias--variance trade-off of structured regression estimates of selection rate (SEL) on diabetes data. Averaged across all groups, small groups (size at most 25), and large groups (size above 25), across 100 draws of evaluation dataset. The scale of the MSE is different for different group sizes, but the minimum MSE is attained around the same value of $\lambda$, thanks to the weighting of the training loss.
  • Figure 3: Mean absolute error of estimates of 6 metrics using 5 methods on diabetes data. Averaged across all groups, small groups (size at most 25), and large groups (size above 25), across 20 draws of evaluation dataset.
  • Figure 4: Coverage and mean relative width of confidence intervals for 6 metrics constructed by 3 methods on diabetes data. Averaged across all groups and across 20 draws of evaluation dataset. Relative width is with respect to the width of the standard confidence interval.
  • Figure 5: Point estimates and 95% confidence intervals of word error rates of five ASR systems.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Example 1: Diabetes
  • Example 2: ASR