Table of Contents
Fetching ...

FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale

TL;DR

This work targets the gap in fairness evaluation for Federated Learning by addressing heterogeneous, client-level biases that are invisible to server-wide metrics. It introduces FeDa4Fair, a library and benchmarking framework to generate bias-heterogeneous FL datasets and four curated benchmarks, enabling rigorous client- and attribute-level fairness assessments using $DD$ and $EOD$. Baseline comparisons with FedAvg and PUFFLE across these datasets reveal that global fairness does not guarantee client-level equity and that current mitigation methods have nuanced, attribute- and value-specific effects. FeDa4Fair thus provides a reproducible, extensible platform to drive fair-FL research, supporting cross-silo and cross-device settings and integration with Flower and the HuggingFace Datasets Hub.

Abstract

Federated Learning (FL) enables collaborative model training across multiple clients without sharing clients' private data. However, the diverse and often conflicting biases present across clients pose significant challenges to model fairness. Current fairness-enhancing FL solutions often fall short, as they typically mitigate biases for a single, usually binary, sensitive attribute, while ignoring the heterogeneous fairness needs that exist in real-world settings. Moreover, these solutions often evaluate unfairness reduction only on the server side, hiding persistent unfairness at the individual client level. To support more robust and reproducible fairness research in FL, we introduce a comprehensive benchmarking framework for fairness-aware FL at both the global and client levels. Our contributions are three-fold: (1) We introduce \fairdataset, a library to create tabular datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release four bias-heterogeneous datasets and corresponding benchmarks to compare fairness mitigation methods in a controlled environment; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.

FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

TL;DR

This work targets the gap in fairness evaluation for Federated Learning by addressing heterogeneous, client-level biases that are invisible to server-wide metrics. It introduces FeDa4Fair, a library and benchmarking framework to generate bias-heterogeneous FL datasets and four curated benchmarks, enabling rigorous client- and attribute-level fairness assessments using and . Baseline comparisons with FedAvg and PUFFLE across these datasets reveal that global fairness does not guarantee client-level equity and that current mitigation methods have nuanced, attribute- and value-specific effects. FeDa4Fair thus provides a reproducible, extensible platform to drive fair-FL research, supporting cross-silo and cross-device settings and integration with Flower and the HuggingFace Datasets Hub.

Abstract

Federated Learning (FL) enables collaborative model training across multiple clients without sharing clients' private data. However, the diverse and often conflicting biases present across clients pose significant challenges to model fairness. Current fairness-enhancing FL solutions often fall short, as they typically mitigate biases for a single, usually binary, sensitive attribute, while ignoring the heterogeneous fairness needs that exist in real-world settings. Moreover, these solutions often evaluate unfairness reduction only on the server side, hiding persistent unfairness at the individual client level. To support more robust and reproducible fairness research in FL, we introduce a comprehensive benchmarking framework for fairness-aware FL at both the global and client levels. Our contributions are three-fold: (1) We introduce \fairdataset, a library to create tabular datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release four bias-heterogeneous datasets and corresponding benchmarks to compare fairness mitigation methods in a controlled environment; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.

Paper Structure

This paper contains 17 sections, 22 figures, 6 tables.

Figures (22)

  • Figure 1: A pictorial representation of the FL scenarios tackled by FeDa4Fair. Clients exhibit varying levels of unfairness, here depicted as a high value of Demographic Disparity. FeDa4Fair creates data where fairness metrics reveal inequalities across attribute values (e.g., Black, Asian), across attributes (e.g., race vs. gender), or both.
  • Figure 2: Attribute bias measured with DD on the XGBoost model for attribute benchmark datasets.
  • Figure 3: Attribute value bias measured with DD on XGBoost for value benchmark datasets.
  • Figure 4: Attribute and attribute value bias measured with DD on the true labels and partitioning data from "LA" and "WY". These plots are generated for any dataset created with FeDa4Fair.
  • Figure 5: Attribute bias toward RACE and SEX measured with DD on the XGBoost model vs. the FedAvg model and vs. PUFFLE for the attribute-silo dataset.
  • ...and 17 more figures