Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing
Lucas Monteiro Paes, Ananda Theertha Suresh, Alex Beutel, Flavio P. Calmon, Ahmad Beirami
TL;DR
This work tackles the challenge of auditing ML fairness across many intersectional groups, where max-gap fairness suffers prohibitive sample complexity. It introduces CVaR fairness as a tunable relaxation that concentrates on the tail of group disparities, enabling substantially reduced sample complexity and, under certain non-i.i.d. data collection schemes, independence from the number of groups. The authors derive both achievability and converse results, showing that CVaR-based testing can be near-optimal and that Rényi entropy of order 2/3 characterizes the sample requirements when weighting groups. They provide practical estimators, two sampling strategies, and an algorithm that reliably tests CVaR fairness; numerical experiments demonstrate favorable performance relative to max-gap fairness, especially as the number of groups grows. Together, these results offer scalable, interpretable tools for auditing complex, intersectional fairness in real-world settings, with an explicit trade-off between fairness strength (through α) and data requirements.
Abstract
Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that Rényi entropy of order 2/3 of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.
