Table of Contents
Fetching ...

Fair Bayesian Data Selection via Generalized Discrepancy Measures

Yixuan Zhang, Jiabin Luo, Zhenggang Wang, Feng Zhou, Quyu Kong

TL;DR

Fair-BADS reframes fairness from a model-centric objective to a data-centric one by jointly inferring model parameters and per-instance weights while softly aligning group-specific posteriors toward a shared central distribution. The alignment is implemented via distributional discrepancies such as $W_2$, $ ext{MMD}$, or $D_f$, and is optimized with Stein variational gradient descent to enable scalable, particle-based inference across large datasets. The authors provide discrepancy-based transfer bounds and intergroup disparity guarantees, and demonstrate superior fairness-accuracy trade-offs on UTKFace, LFW-A, and FairFace compared with ERM, FairBatch, and prior Bayesian data-selection methods. The approach remains effective even with limited or no meta-data by employing a zero-shot surrogate objective, highlighting its practicality for real-world, large-scale fairness challenges.

Abstract

Fairness concerns are increasingly critical as machine learning models are deployed in high-stakes applications. While existing fairness-aware methods typically intervene at the model level, they often suffer from high computational costs, limited scalability, and poor generalization. To address these challenges, we propose a Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions of model parameters and sample weights with a shared central distribution. Our framework supports flexible alignment via various distributional discrepancy measures, including Wasserstein distance, maximum mean discrepancy, and $f$-divergence, allowing geometry-aware control without imposing explicit fairness constraints. This data-centric approach mitigates group-specific biases in training data and improves fairness in downstream tasks, with theoretical guarantees. Experiments on benchmark datasets show that our method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.

Fair Bayesian Data Selection via Generalized Discrepancy Measures

TL;DR

Fair-BADS reframes fairness from a model-centric objective to a data-centric one by jointly inferring model parameters and per-instance weights while softly aligning group-specific posteriors toward a shared central distribution. The alignment is implemented via distributional discrepancies such as , , or , and is optimized with Stein variational gradient descent to enable scalable, particle-based inference across large datasets. The authors provide discrepancy-based transfer bounds and intergroup disparity guarantees, and demonstrate superior fairness-accuracy trade-offs on UTKFace, LFW-A, and FairFace compared with ERM, FairBatch, and prior Bayesian data-selection methods. The approach remains effective even with limited or no meta-data by employing a zero-shot surrogate objective, highlighting its practicality for real-world, large-scale fairness challenges.

Abstract

Fairness concerns are increasingly critical as machine learning models are deployed in high-stakes applications. While existing fairness-aware methods typically intervene at the model level, they often suffer from high computational costs, limited scalability, and poor generalization. To address these challenges, we propose a Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions of model parameters and sample weights with a shared central distribution. Our framework supports flexible alignment via various distributional discrepancy measures, including Wasserstein distance, maximum mean discrepancy, and -divergence, allowing geometry-aware control without imposing explicit fairness constraints. This data-centric approach mitigates group-specific biases in training data and improves fairness in downstream tasks, with theoretical guarantees. Experiments on benchmark datasets show that our method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.

Paper Structure

This paper contains 35 sections, 4 theorems, 50 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\tilde{p}^\star$ be the empirical central distribution minimizing eq: barycenter and define $\bar{R}\triangleq\sum_{s=1}^S \lambda_s R_s(\tilde{p}_s)$. Under (A1), Concretely:

Figures (2)

  • Figure 1: An illustration of Fair-BADS. Fair-BADS jointly infers model parameters and sample weights while reducing bias via posterior alignment to a central distribution.
  • Figure 2: Comparison of sample weight distributions across demographic groups. Left: KDE of sample weights $\mathbf{w}$ at the final training epoch for groups $s=0$ and $s=1$. Right: Wasserstein distance between group-specific weight distributions over training epochs.

Theorems & Definitions (6)

  • Theorem 1: Discrepancy Transfer Bound
  • Theorem 2: Group Fairness Disparity Bound
  • Remark 1
  • Proposition 1: Divergence preservation under padding
  • Theorem 3: Vanishing effective cross--group term
  • Remark 2