Table of Contents
Fetching ...

Fairness Improvement with Multiple Protected Attributes: How Far Are We?

Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Mark Harman

TL;DR

This work tackles the problem of fairness in ML when multiple protected attributes are present, challenging the prevalent single-attribute focus. It provides a rigorous empirical evaluation of 11 state-of-the-art fairness-improvement methods across five real-world datasets and four ML models, using 15 fairness-performance measurements. The results reveal substantial cross-attribute trade-offs: improving fairness for one attribute often degrades fairness for unconsidered attributes (up to 88.3% of scenarios; 57.5% on average), while accuracy remains largely unchanged but F1-score and MCC degrade more in multi-attribute settings. The study offers actionable guidance for practitioners, introduces practical baselines, and underscores the need for multi-metric reporting and intersectional evaluation in fairness research.

Abstract

Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.

Fairness Improvement with Multiple Protected Attributes: How Far Are We?

TL;DR

This work tackles the problem of fairness in ML when multiple protected attributes are present, challenging the prevalent single-attribute focus. It provides a rigorous empirical evaluation of 11 state-of-the-art fairness-improvement methods across five real-world datasets and four ML models, using 15 fairness-performance measurements. The results reveal substantial cross-attribute trade-offs: improving fairness for one attribute often degrades fairness for unconsidered attributes (up to 88.3% of scenarios; 57.5% on average), while accuracy remains largely unchanged but F1-score and MCC degrade more in multi-attribute settings. The study offers actionable guidance for practitioners, introduces practical baselines, and underscores the need for multi-metric reporting and intersectional evaluation in fairness research.

Abstract

Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.
Paper Structure (25 sections, 2 figures, 9 tables)

This paper contains 25 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: (RQ3.2) Effectiveness level distributions of existing methods in fairness-performance trade-off when dealing with multiple protected attributes. MAAT, FairMask, and RW achieve the best trade-off, with 81.2%, 80.6%, and 76.9% of cases falling into the win-win or good trade-off, respectively.
  • Figure 2: (RQ4) Effectiveness in intersectional fairness improvement and fairness-performance trade-off of the best three methods identified in this study (i.e., RW, MAAT, and FairMask) across various datasets, models, and measurements. We observe that it is challenging for these methods to achieve a good fairness-performance trade-off for imbalanced datasets and precision-critical applications.