Table of Contents
Fetching ...

Navigating Towards Fairness with Data Selection

Yixuan Zhang, Zhidong Li, Yang Wang, Fang Chen, Xuhui Fan, Feng Zhou

TL;DR

This work tackles fairness in ML under label bias by reframing fairness as a data-selection problem. It introduces a tractable fair data-selection principle that aligns training with a fair data distribution, using a zero-shot predictor as a holdout proxy and a peer-prediction mechanism to guard against bias. The final objective combines current loss, proxy holdout loss, and bias correction, and includes a resampling step to mitigate selection bias, enabling scalability to large datasets. Empirical results on CelebA and LFW+a show improved accuracy and reduced fairness violations across varying label-bias levels, with faster convergence and modality-agnostic applicability to standard log-likelihood or cross-entropy classifiers.

Abstract

Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for large-scale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier's architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.

Navigating Towards Fairness with Data Selection

TL;DR

This work tackles fairness in ML under label bias by reframing fairness as a data-selection problem. It introduces a tractable fair data-selection principle that aligns training with a fair data distribution, using a zero-shot predictor as a holdout proxy and a peer-prediction mechanism to guard against bias. The final objective combines current loss, proxy holdout loss, and bias correction, and includes a resampling step to mitigate selection bias, enabling scalability to large datasets. Empirical results on CelebA and LFW+a show improved accuracy and reduced fairness violations across varying label-bias levels, with faster convergence and modality-agnostic applicability to standard log-likelihood or cross-entropy classifiers.

Abstract

Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for large-scale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier's architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.

Paper Structure

This paper contains 29 sections, 32 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Proportion of selected instances discriminated by label bias using the proposed method (Ours), RHO-LOSS, and Uniform Sampling. The left plot corresponds to the LFW+a dataset, and the right plot corresponds to the CelebA dataset. Overall, we can observe that the proposed method has the lowest rate of discriminated sample selection.
  • Figure 2: Ablation studies on critical hyperparameters, including $\gamma$, $\alpha$, and selection ratio, on the CelebA dataset with a 40% label bias amount. We use blue to denote accuracy (left axis) and purple to denote fairness violation (right axis).