Fairness Improvement with Multiple Protected Attributes: How Far Are We?
Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Mark Harman
TL;DR
This work tackles the problem of fairness in ML when multiple protected attributes are present, challenging the prevalent single-attribute focus. It provides a rigorous empirical evaluation of 11 state-of-the-art fairness-improvement methods across five real-world datasets and four ML models, using 15 fairness-performance measurements. The results reveal substantial cross-attribute trade-offs: improving fairness for one attribute often degrades fairness for unconsidered attributes (up to 88.3% of scenarios; 57.5% on average), while accuracy remains largely unchanged but F1-score and MCC degrade more in multi-attribute settings. The study offers actionable guidance for practitioners, introduces practical baselines, and underscores the need for multi-metric reporting and intersectional evaluation in fairness research.
Abstract
Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.
