FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation
Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale
TL;DR
This work targets the gap in fairness evaluation for Federated Learning by addressing heterogeneous, client-level biases that are invisible to server-wide metrics. It introduces FeDa4Fair, a library and benchmarking framework to generate bias-heterogeneous FL datasets and four curated benchmarks, enabling rigorous client- and attribute-level fairness assessments using $DD$ and $EOD$. Baseline comparisons with FedAvg and PUFFLE across these datasets reveal that global fairness does not guarantee client-level equity and that current mitigation methods have nuanced, attribute- and value-specific effects. FeDa4Fair thus provides a reproducible, extensible platform to drive fair-FL research, supporting cross-silo and cross-device settings and integration with Flower and the HuggingFace Datasets Hub.
Abstract
Federated Learning (FL) enables collaborative model training across multiple clients without sharing clients' private data. However, the diverse and often conflicting biases present across clients pose significant challenges to model fairness. Current fairness-enhancing FL solutions often fall short, as they typically mitigate biases for a single, usually binary, sensitive attribute, while ignoring the heterogeneous fairness needs that exist in real-world settings. Moreover, these solutions often evaluate unfairness reduction only on the server side, hiding persistent unfairness at the individual client level. To support more robust and reproducible fairness research in FL, we introduce a comprehensive benchmarking framework for fairness-aware FL at both the global and client levels. Our contributions are three-fold: (1) We introduce \fairdataset, a library to create tabular datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release four bias-heterogeneous datasets and corresponding benchmarks to compare fairness mitigation methods in a controlled environment; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.
