Fair Bayesian Data Selection via Generalized Discrepancy Measures
Yixuan Zhang, Jiabin Luo, Zhenggang Wang, Feng Zhou, Quyu Kong
TL;DR
Fair-BADS reframes fairness from a model-centric objective to a data-centric one by jointly inferring model parameters and per-instance weights while softly aligning group-specific posteriors toward a shared central distribution. The alignment is implemented via distributional discrepancies such as $W_2$, $ ext{MMD}$, or $D_f$, and is optimized with Stein variational gradient descent to enable scalable, particle-based inference across large datasets. The authors provide discrepancy-based transfer bounds and intergroup disparity guarantees, and demonstrate superior fairness-accuracy trade-offs on UTKFace, LFW-A, and FairFace compared with ERM, FairBatch, and prior Bayesian data-selection methods. The approach remains effective even with limited or no meta-data by employing a zero-shot surrogate objective, highlighting its practicality for real-world, large-scale fairness challenges.
Abstract
Fairness concerns are increasingly critical as machine learning models are deployed in high-stakes applications. While existing fairness-aware methods typically intervene at the model level, they often suffer from high computational costs, limited scalability, and poor generalization. To address these challenges, we propose a Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions of model parameters and sample weights with a shared central distribution. Our framework supports flexible alignment via various distributional discrepancy measures, including Wasserstein distance, maximum mean discrepancy, and $f$-divergence, allowing geometry-aware control without imposing explicit fairness constraints. This data-centric approach mitigates group-specific biases in training data and improves fairness in downstream tasks, with theoretical guarantees. Experiments on benchmark datasets show that our method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.
