Navigating Towards Fairness with Data Selection
Yixuan Zhang, Zhidong Li, Yang Wang, Fang Chen, Xuhui Fan, Feng Zhou
TL;DR
This work tackles fairness in ML under label bias by reframing fairness as a data-selection problem. It introduces a tractable fair data-selection principle that aligns training with a fair data distribution, using a zero-shot predictor as a holdout proxy and a peer-prediction mechanism to guard against bias. The final objective combines current loss, proxy holdout loss, and bias correction, and includes a resampling step to mitigate selection bias, enabling scalability to large datasets. Empirical results on CelebA and LFW+a show improved accuracy and reduced fairness violations across varying label-bias levels, with faster convergence and modality-agnostic applicability to standard log-likelihood or cross-entropy classifiers.
Abstract
Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for large-scale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier's architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.
