Table of Contents
Fetching ...

Adversarial Bias: Data Poisoning Attacks on Fairness

Eunice Chan, Hanghang Tong

TL;DR

This work investigates fairness vulnerability in ML by formalizing a targeted data-poisoning problem and proving that naive Bayes classifiers can be driven to maximal unfairness with adversarial data. It introduces Proportional Fairness Attack (PFA), a non-differentiable, surrogate-guided framework that incrementally poisons data to increase disparity while preserving overall accuracy; it uses a dynamic sampling strategy and a Continuous Disparity Margin to select which protected group to target. The authors provide theoretical analysis for maximal unfairness and demonstrate, through extensive experiments on German, Drug, and COMPAS datasets across four base models, that PFA outperforms existing methods in degrading fairness metrics such as SPD and EOD, often achieving $SPD=1$ and $EOD=1$ in multiple settings. The results underscore the practical vulnerability of fairness in deployed systems and highlight the need for defenses and monitoring to ensure trustworthy deployment in real-world scenarios.

Abstract

With the growing adoption of AI and machine learning systems in real-world applications, ensuring their fairness has become increasingly critical. The majority of the work in algorithmic fairness focus on assessing and improving the fairness of machine learning systems. There is relatively little research on fairness vulnerability, i.e., how an AI system's fairness can be intentionally compromised. In this work, we first provide a theoretical analysis demonstrating that a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers. Our key idea is to strategically inject a small fraction of carefully crafted adversarial data points into the training set, biasing the model's decision boundary to disproportionately affect a protected group while preserving generalizable performance. To illustrate the practical effectiveness of our method, we conduct experiments across several benchmark datasets and models. We find that our attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher levels of unfairness with a comparable or only slightly worse impact on accuracy. Notably, our method proves effective on a wide range of models, in contrast to prior work, demonstrating a robust and potent approach to compromising the fairness of machine learning systems.

Adversarial Bias: Data Poisoning Attacks on Fairness

TL;DR

This work investigates fairness vulnerability in ML by formalizing a targeted data-poisoning problem and proving that naive Bayes classifiers can be driven to maximal unfairness with adversarial data. It introduces Proportional Fairness Attack (PFA), a non-differentiable, surrogate-guided framework that incrementally poisons data to increase disparity while preserving overall accuracy; it uses a dynamic sampling strategy and a Continuous Disparity Margin to select which protected group to target. The authors provide theoretical analysis for maximal unfairness and demonstrate, through extensive experiments on German, Drug, and COMPAS datasets across four base models, that PFA outperforms existing methods in degrading fairness metrics such as SPD and EOD, often achieving and in multiple settings. The results underscore the practical vulnerability of fairness in deployed systems and highlight the need for defenses and monitoring to ensure trustworthy deployment in real-world scenarios.

Abstract

With the growing adoption of AI and machine learning systems in real-world applications, ensuring their fairness has become increasingly critical. The majority of the work in algorithmic fairness focus on assessing and improving the fairness of machine learning systems. There is relatively little research on fairness vulnerability, i.e., how an AI system's fairness can be intentionally compromised. In this work, we first provide a theoretical analysis demonstrating that a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers. Our key idea is to strategically inject a small fraction of carefully crafted adversarial data points into the training set, biasing the model's decision boundary to disproportionately affect a protected group while preserving generalizable performance. To illustrate the practical effectiveness of our method, we conduct experiments across several benchmark datasets and models. We find that our attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher levels of unfairness with a comparable or only slightly worse impact on accuracy. Notably, our method proves effective on a wide range of models, in contrast to prior work, demonstrating a robust and potent approach to compromising the fairness of machine learning systems.

Paper Structure

This paper contains 18 sections, 9 theorems, 32 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

If a classifier exhibits $\hat{Y}=s$ behavior (i.e., its predictions perfectly align with the sensitive attribute), then both Statistical Parity Difference (SPD) and Equalized Odds Difference (EOD) attain their maximum value of 1.

Figures (9)

  • Figure 1: Our method consistently occupies a Pareto-dominant front, achieving higher disparity (EOD) and better trade-offs compared to other methods.
  • Figure 2: Our method consistently achieves the most significant degradation fairness compared to other attack methods against the base models over a varying number of poisoned samples.
  • Figure 3: Our method consistently offers the best balance, achieving notably higher fairness degradation when comparing the trade-off between the fairness attack methods for each dataset and base model across all epsilon.
  • Figure 4: Comparison of PFA using $\hat{Y}$ versus $Y$ against a Gaussian naive Bayes model over a varying number of poisoned samples. Using $\hat{Y}$ generally results in a stronger attack.
  • Figure 5: Comparison of candidate dataset selection methods against a Gaussian naive Bayes model over a varying number of poisoned samples. Our selection method allows users to select a dataset with their desired trade-offs.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Lemma 1: Maximally Unfair Behavior
  • proof
  • Theorem 1: Inducing Maximally Unfair Behavior in Naive Bayes via Adversarial Score Biasing
  • proof
  • Lemma 2: Prior Balance
  • proof
  • Lemma 3: Group Posterior Divergence
  • proof
  • Lemma 4: Feature-Label Independence
  • proof
  • ...and 8 more