Table of Contents
Fetching ...

Evolved SampleWeights for Bias Mitigation: Effectiveness Depends on Optimization Objectives

Anil K. Saini, Jose Guadalupe Hernandez, Emily F. Wong, Debanshi Misra, Jason H. Moore

TL;DR

This work tackles algorithmic bias in real-world ML by evaluating reweighting as a pre-processing bias-mitigation technique. It systematically compares Equal, Deterministic, and Evolved Weights—where EW uses a Genetic Algorithm (NSGA-II) to optimize for both predictive performance and fairness—across 11 publicly available datasets. The study finds that EW often yields higher Pareto-front hypervolume than the baselines, but the extent of improvement depends strongly on the selected objective pair (e.g., $ACC$ or $AUROC$ combined with $DPD$ or $SFN$). This underscores the importance of aligning optimization criteria with desired fairness–performance outcomes and demonstrates the potential of evolutionary reweighting for bias mitigation in high-stakes domains, such as healthcare.

Abstract

Machine learning models trained on real-world data may inadvertently make biased predictions that negatively impact marginalized communities. Reweighting is a method that can mitigate such bias in model predictions by assigning a weight to each data point used during model training. In this paper, we compare three methods for generating these weights: (1) evolving them using a Genetic Algorithm (GA), (2) computing them using only dataset characteristics, and (3) assigning equal weights to all data points. Model performance under each strategy was evaluated using paired predictive and fairness metrics, which also served as optimization objectives for the GA during evolution. Specifically, we used two predictive metrics (accuracy and area under the Receiver Operating Characteristic curve) and two fairness metrics (demographic parity difference and subgroup false negative fairness). Using experiments on eleven publicly available datasets (including two medical datasets), we show that evolved sample weights can produce models that achieve better trade-offs between fairness and predictive performance than alternative weighting methods. However, the magnitude of these benefits depends strongly on the choice of optimization objectives. Our experiments reveal that optimizing with accuracy and demographic parity difference metrics yields the largest number of datasets for which evolved weights are significantly better than other weighting strategies in optimizing both objectives.

Evolved SampleWeights for Bias Mitigation: Effectiveness Depends on Optimization Objectives

TL;DR

This work tackles algorithmic bias in real-world ML by evaluating reweighting as a pre-processing bias-mitigation technique. It systematically compares Equal, Deterministic, and Evolved Weights—where EW uses a Genetic Algorithm (NSGA-II) to optimize for both predictive performance and fairness—across 11 publicly available datasets. The study finds that EW often yields higher Pareto-front hypervolume than the baselines, but the extent of improvement depends strongly on the selected objective pair (e.g., or combined with or ). This underscores the importance of aligning optimization criteria with desired fairness–performance outcomes and demonstrates the potential of evolutionary reweighting for bias mitigation in high-stakes domains, such as healthcare.

Abstract

Machine learning models trained on real-world data may inadvertently make biased predictions that negatively impact marginalized communities. Reweighting is a method that can mitigate such bias in model predictions by assigning a weight to each data point used during model training. In this paper, we compare three methods for generating these weights: (1) evolving them using a Genetic Algorithm (GA), (2) computing them using only dataset characteristics, and (3) assigning equal weights to all data points. Model performance under each strategy was evaluated using paired predictive and fairness metrics, which also served as optimization objectives for the GA during evolution. Specifically, we used two predictive metrics (accuracy and area under the Receiver Operating Characteristic curve) and two fairness metrics (demographic parity difference and subgroup false negative fairness). Using experiments on eleven publicly available datasets (including two medical datasets), we show that evolved sample weights can produce models that achieve better trade-offs between fairness and predictive performance than alternative weighting methods. However, the magnitude of these benefits depends strongly on the choice of optimization objectives. Our experiments reveal that optimizing with accuracy and demographic parity difference metrics yields the largest number of datasets for which evolved weights are significantly better than other weighting strategies in optimizing both objectives.

Paper Structure

This paper contains 18 sections, 3 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: An example of two different decision boundaries producing the same overall accuracy during training on a given dataset. The decision boundary on the left makes less fair predictions than the decision boundary on the right.
  • Figure 2: Hypervolume for a given Pareto front comprised of three solution performances: $p1$, $p2$, and $p3$. The reference point ($r$) is used to calculate the hypervolume (shaded region) of the front.
  • Figure 3: Hypervolume for the Pareto fronts constructed from the performance on the test data for various datasets. The figure shows the experimental condition where accuracy and demographic parity difference have been used as the predictive and fairness metrics, respectively. Each point in the figure represents a single run.