FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

Xiaotian Han; Jianfeng Chi; Yu Chen; Qifan Wang; Han Zhao; Na Zou; Xia Hu

FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

Xiaotian Han, Jianfeng Chi, Yu Chen, Qifan Wang, Han Zhao, Na Zou, Xia Hu

TL;DR

FFB addresses the lack of standardized benchmarking for in-processing group fairness methods by delivering an extensible, minimalistic, open-source framework with unified pipelines for preprocessing, metrics, and evaluation. It conducts an expansive analysis over 45,079 experiments and 14 datasets spanning tabular, image, and text domains, evaluating 9 group-fairness metrics against 6 utility metrics. The study finds that HSIC generally offers the best utility-fairness balance on tabular data, while adversarial debiasing methods can be unstable and harder to control; it also highlights dataset bias variability and the impact of training dynamics and hyperparameters on fairness outcomes. Overall, FFB provides practical guidance for researchers and practitioners to conduct fair, reproducible evaluations and to extend fairness methods within a cohesive benchmarking ecosystem.

Abstract

This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework for in-processing group fairness methods. Ensuring fairness in machine learning is important for ethical compliance. However, there exist challenges in comparing and developing fairness methods due to inconsistencies in experimental settings, lack of accessible algorithmic implementations, and limited extensibility of current fairness packages and tools. To address these issues, we introduce an open-source standardized benchmark for evaluating in-processing group fairness methods and provide a comprehensive analysis of state-of-the-art methods to ensure different notions of group fairness. This work offers the following key contributions: the provision of flexible, extensible, minimalistic, and research-oriented open-source code; the establishment of unified fairness method benchmarking pipelines; and extensive benchmarking, which yields key insights from $\mathbf{45,079}$ experiments, $\mathbf{14,428}$ GPU hours. We believe that our work will significantly facilitate the growth and development of the fairness research community.

FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

TL;DR

Abstract

experiments,

GPU hours. We believe that our work will significantly facilitate the growth and development of the fairness research community.

Paper Structure (29 sections, 16 figures, 7 tables, 3 algorithms)

This paper contains 29 sections, 16 figures, 7 tables, 3 algorithms.

Introduction
Why is this Benchmark Needed?
FFB: Fair Fairness Benchmark
Group Fairness Metrics
Benchmarking Datasets
Data Preprocessing
Benchmarking Fairness Methods
Bias Examination for Widely Used Fairness Benchmark Datasets
Benchmarking Current Fairness Methods
How the Bias Mitigating Methods Perform on Utility-Fairness Trade-offs?
Can the Utility-fairness Trade-offs be Controlled?
How do Utility and Fairness Performance Change During Training Process?
Experiment on Text Data
Discussions
Appendix
...and 14 more sections

Figures (16)

Figure 1: The utility-fairness trade-offs of current fairness methods -- DiffDP, PRemover, HSIC, LAFTR, and AdvDebias. To plot the fairness and utility performance in one figure, for each dataset, we normalize the utility (acc,auc) and fairness (abcc, dp) based on the performance of ERM, which is denoted as the point $(1.0,1.0)$. The figures clearly show that utility-fairness exhibits trader-offs. These figures are generated from a total of $\mathbf{27568}$ runs.
Figure 2: The fairness performance with varying fairness control hyperparameters. The intensity of the color represents the size of the control parameters. In most cases, the larger value of control parameters yields better fairness performance, while small ones have worse fairness performance. These figures are generated from $\mathbf{13110}$ runs of experiments.
Figure 3: The training curves on tabular dataset. The training curves for fairness metrics typically have larger standard deviation than utility performance, showing the instability of fairness performance.
Figure 4: The training curves on image dataset. The results are similar to tabular dataset that training curves for fairness metrics typically have larger standard deviation than utility performance.
Figure 5: The utility-fairness trade-offs of current fairness methods -- DiffDP, PRemover, and HSIC on text data. To plot the fairness and utility performance in one figure, for each dataset, we normalize the utility (acc,auc) and fairness (abcc, dp) based on the performance of ERM, which is denoted as the point $(1.0,1.0)$. The figures show that utility-fairness exhibits trader-offs.
...and 11 more figures

FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

TL;DR

Abstract

FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (16)