Table of Contents
Fetching ...

Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML

Prakhar Ganesh, Usman Gohar, Lu Cheng, Golnoosh Farnadi

TL;DR

This work shows significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another.

Abstract

With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.

Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML

TL;DR

This work shows significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another.

Abstract

With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.

Paper Structure

This paper contains 20 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Motivation behind a more nuanced and context-aware benchmarking of bias mitigation techniques, instead of using a uniform evaluation setup or attempting to find the "best" technique.
  • Figure 2: Fairness-utility (demographic parity-accuracy) tradeoff across various settings for the Adult dataset. Each graph represents a different combination of hyperparameters, and each dot in the graph represents a separate training run. Multiple dots for the same mitigation algorithm in the same graph represent runs with changing random seeds and control parameters.
  • Figure 3: Fairness-utility (demographic parity-accuracy) tradeoff across various datasets, under their default hyperparameters. Each dot in the graph represents a separate training run with changing random seeds and control parameters.
  • Figure 4: Pareto front of the fairness-utility (demographic parity-accuracy) tradeoff across various datasets. Each dot in the graph represents a separate training run on the pareto front with changing hyperparameters, random seeds and control parameters.
  • Figure 5: Fairness-utility (equalized odds-accuracy) tradeoff across various settings for the Adult dataset. Each graph represents a different combination of hyperparameters, and each dot in the graph represents a separate training run. Multiple dots for the same mitigation algorithm in the same graph represent runs with changing random seeds and control parameters.
  • ...and 11 more figures