Table of Contents
Fetching ...

Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing

Lucas Monteiro Paes, Ananda Theertha Suresh, Alex Beutel, Flavio P. Calmon, Ahmad Beirami

TL;DR

This work tackles the challenge of auditing ML fairness across many intersectional groups, where max-gap fairness suffers prohibitive sample complexity. It introduces CVaR fairness as a tunable relaxation that concentrates on the tail of group disparities, enabling substantially reduced sample complexity and, under certain non-i.i.d. data collection schemes, independence from the number of groups. The authors derive both achievability and converse results, showing that CVaR-based testing can be near-optimal and that Rényi entropy of order 2/3 characterizes the sample requirements when weighting groups. They provide practical estimators, two sampling strategies, and an algorithm that reliably tests CVaR fairness; numerical experiments demonstrate favorable performance relative to max-gap fairness, especially as the number of groups grows. Together, these results offer scalable, interpretable tools for auditing complex, intersectional fairness in real-world settings, with an explicit trade-off between fairness strength (through α) and data requirements.

Abstract

Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that Rényi entropy of order 2/3 of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.

Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing

TL;DR

This work tackles the challenge of auditing ML fairness across many intersectional groups, where max-gap fairness suffers prohibitive sample complexity. It introduces CVaR fairness as a tunable relaxation that concentrates on the tail of group disparities, enabling substantially reduced sample complexity and, under certain non-i.i.d. data collection schemes, independence from the number of groups. The authors derive both achievability and converse results, showing that CVaR-based testing can be near-optimal and that Rényi entropy of order 2/3 characterizes the sample requirements when weighting groups. They provide practical estimators, two sampling strategies, and an algorithm that reliably tests CVaR fairness; numerical experiments demonstrate favorable performance relative to max-gap fairness, especially as the number of groups grows. Together, these results offer scalable, interpretable tools for auditing complex, intersectional fairness in real-world settings, with an explicit trade-off between fairness strength (through α) and data requirements.

Abstract

Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that Rényi entropy of order 2/3 of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.
Paper Structure (29 sections, 28 theorems, 188 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 29 sections, 28 theorems, 188 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

(Converse for Max-Gap Fairness). For max-gap fairness metrics $\texttt{F\_MaxGap}$ (e.g., equal opportunity in Example eg:equal_opportunity and statistical parity in Example eg:statistical_parity) with quality of service function $L$ and measuring the average quality of service using the distributio Furthermore, it is necessary to have access to $n$ given in eq:max_gap_lower_bd i.i.d. samples from

Figures (4)

  • Figure 1: Probability of error in $\epsilon$-test vs. $\alpha$ for different number of sample sizes. We use $\epsilon = 0.4 {{\times}} \texttt{F\_CVaR}_{\alpha}(\mathbf{w})$ in Algorithm \ref{['alg:hypothesis_test']} --- we exactly compute CVaR fairness to evaluate the probability of error. We use $d = 9$ sensitive attributes leading to $512$ groups, and the group distribution parameter is taken to be $p = 0.2$. The probability of error was computed using $1000$ realizations with the Monte Carlo method, and confidence intervals are plotted using Bootstrap from Seaborn Waskom2021 with $95\%$ confidence.
  • Figure 2: False positive rate at a given false negative rate for the hypothesis tests described in the paper. We use $d = 10$ sensitive groups, generating $1024$ groups and under a data budget of $n = 512$ Bernoulli realizations. We use Approach \ref{['approach3']} to distribute the per group quality of services. The false positive rate was computed using $1000$ realizations in the Monte Carlo method, and confidence intervals are plotted using Bootstrap from Seaborn Waskom2021 with $95\%$ confidence.
  • Figure 3: Area under the false negative vs. false positive curve (AUC) versus the number of samples used to perform the $\epsilon$-test. We vary the entropy by changing the group distribution Bernoulli parameter $p$ with a fixed number of sensitive attributes $d = 10$, generating $1024$ groups and data budget in $x$-axis. We use $100$ samples in the Monte Carlo method and repeat this process $20$ times to estimate AUC and plot confidence intervals using Bootstrap from Seaborn Waskom2021 with $95\%$ confidence.
  • Figure 4: The maximum number of groups $\mathcal{G}$ (z-axis) for each choice of threshold $\epsilon$ (x-axis) and sample size (y-axis) to ensure that the probability of error in the hypothesis test from Definition \ref{['def:Hypothesis_Test']} is smaller than $45\%$. Figure (a) shows the maximum number of groups when testing if max-gap fairness is bigger than $\epsilon$. Figures (b) the maximum number of groups when testing if CVaR fairness is bigger than $\epsilon$ using an $\alpha$ value of $0.9$. The maximum number of groups was computed using \ref{['eq:max_gap_max_groups']} for max-gap and \ref{['eq:cvar_max_groups']} for CVaR fairness.

Theorems & Definitions (38)

  • Definition 1: $\epsilon$-Test
  • Definition 2: Max-Gap Fairness
  • Example 1: Equal Opportunity
  • Example 2: Statistical Parity
  • Proposition 1
  • Definition 3: CVaR Fairness
  • Example 3: CVaR Equal Opportunity
  • Proposition 2
  • Proposition 3
  • Definition 4: Data Budget
  • ...and 28 more