Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Pushkar Shukla; Dhruv Srikanth; Lee Cohen; Matthew Turk

Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Pushkar Shukla, Dhruv Srikanth, Lee Cohen, Matthew Turk

TL;DR

The paper targets bias in computer vision by introducing Attribute-Specific Adversarial Counterfactuals (ASACs) and a curriculum-based fine-tuning framework that uses ASACs to debias a target classifier while preserving or improving accuracy. It combines adversarial counterfactual generation (via FGSM/PGD) with a two-stage curriculum and an adversarial loss term to guide post-processing debiasing. Across CelebA and LFW, with multiple backbones, the method achieves improvements in fairness metrics such as $DDP$, $DEO$, and $DEOp$ without sacrificing $ACC$, demonstrating robustness and generalization. The work offers a practical, ethics-conscious approach to bias mitigation that minimizes reliance on generator-based counterfactuals and enhances model interpretability through attribution analysis.

Abstract

We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.

Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

TL;DR

, and

without sacrificing

, demonstrating robustness and generalization. The work offers a practical, ethics-conscious approach to bias mitigation that minimizes reliance on generator-based counterfactuals and enhances model interpretability through attribution analysis.

Abstract

Paper Structure (35 sections, 11 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 11 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related Works
Adversarial examples
Bias mitigation in computer vision
Counterfactuals and their role in machine learning
The Relationship Between Counterfactual and Adversarial Examples
Curriculum Learning
Method
Notations
Generating attribute-specific adversarial examples
Curriculum Learning
Computing Difficulty Scores
Curriculum Assignment
Fine-Grained Control using Adversarial Loss
Results
...and 20 more sections

Figures (6)

Figure 1: An example of gender counterfactuals produced by StyleGAN2 versus our method, for a smile classification task. StyleGAN2's images (right) correlate femininity with darker lipstick and exaggerated smiling, introducing biases. Our approach (left) generates ASACs that retain the same visual appearance as the original image.
Figure 2: Bias Mitigation Strategy: Our proposed solution for mitigating biases in a model (e.g., smile classifier) $M(\theta,\rho)$ involves training sensitive attribute classifier $C(\theta,\phi)$ (shown in the network architecture). We then follow a three-stage pipeline. (1) We generate ASACs that are capable of deceiving $C(\theta,\phi)$. (2) We define a curriculum assignment strategy that organizes these ASACs on based on the degree to which they deceive the original model $M(\theta,\rho)$. (3) We fine-tune the original model $M(\theta,\rho)$ using the organized ASACs and the proposed loss function (see Equation \ref{['eq:adv_loss']}).
Figure 3: We look at the Integrated Gradients and for samples pre (IG before) and post (IG after) training.
Figure 4: Qualitative results showing that our trained model becomes robust to ASACs after training.
Figure 5: Examples where the adversarial noise is unable to flip the smile classifier. As shown in the figure adding adversarial noise does not change the decision of the smile classifier (red curve). Post training, the decision remains unchanged as well.
...and 1 more figures

Theorems & Definitions (3)

Definition 1: Equalized Odds
Definition 2: Demographic Parity
Definition 3: Equalized Opportunity

Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

TL;DR

Abstract

Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (3)