Table of Contents
Fetching ...

Training on Plausible Counterfactuals Removes Spurious Correlations

Shpresim Sadiku, Kartikeya Chitranshi, Hiroshi Kera, Sebastian Pokutta

TL;DR

This work investigates plausible counterfactual explanations (p-CFEs) as training data, showing that labeling minimal, plausible perturbations with induced incorrect targets enables classifiers to achieve high accuracy on clean inputs while markedly reducing reliance on spurious correlations. By formulating p-CFE generation with a differentiable plausibility term and solving via proximal gradient methods, the authors demonstrate that models trained on p-CFEs outperform those trained with standard data and other perturbation schemes in mitigating background-based biases. The key finding is that p-CFEs not only flip predictions but also steer models toward learning semantic, data-aligned features, leading to improved worst-group performance on spurious-dataset benchmarks such as WaterBirds. The approach is data-efficient and model-agnostic, with potential to scale to higher-dimensional data and larger models, offering practical implications for robust and fair learning without requiring group labels.

Abstract

Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.

Training on Plausible Counterfactuals Removes Spurious Correlations

TL;DR

This work investigates plausible counterfactual explanations (p-CFEs) as training data, showing that labeling minimal, plausible perturbations with induced incorrect targets enables classifiers to achieve high accuracy on clean inputs while markedly reducing reliance on spurious correlations. By formulating p-CFE generation with a differentiable plausibility term and solving via proximal gradient methods, the authors demonstrate that models trained on p-CFEs outperform those trained with standard data and other perturbation schemes in mitigating background-based biases. The key finding is that p-CFEs not only flip predictions but also steer models toward learning semantic, data-aligned features, leading to improved worst-group performance on spurious-dataset benchmarks such as WaterBirds. The approach is data-efficient and model-agnostic, with potential to scale to higher-dimensional data and larger models, offering practical implications for robust and fair learning without requiring group labels.

Abstract

Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.

Paper Structure

This paper contains 20 sections, 2 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Random samples from our WaterBirds training set variants. The bottom row shows perturbations (magnified 40 times for visibility) applied to the original image on the left by different methods. The true label is water bird; the target label is land bird.
  • Figure 2: Top row: Original and Grad-CAM selvaraju2017grad visualizations for a misclassified landbird (with a water background) from the WaterBirds dataset—incorrectly predicted as a waterbird. Bottom row: Original and Grad-CAM visualizations for a misclassified big dog (with an indoor background) from the SpuCoAnimals dataset—incorrectly predicted as a small dog.
  • Figure 3: Saliency maps for different models. Left to right: (1) original image of a land bird (top) and a dog (bottom), (2) saliency map from a standard model, (3-5) maps from models trained on PGD ($\ell_2$, $\ell_\infty$) and CFE $\ell_2$ adversarial examples, (6) maps from models trained on p-CFE $\ell_0$ examples. All models use ResNet50.
  • Figure 4: Top row: Original and Grad-CAM visualizations for a misclassified landbird (with a water background) from the WaterBirds dataset—incorrectly predicted as a waterbird. Bottom row: Original and Grad-CAM visualizations for a misclassified small dog (with an outdoor background) from the SpuCoAnimals dataset—incorrectly predicted as a big dog.
  • Figure 5: Comparison of the accuracy of the model trained on CFE $\ell_2$ perturbations and noise trained model on the clean dataset. Data was acquired from a uniform distribution. The hyper-paramaters $\lambda=0.001$ and $\lambda_{CF} = 0.01$ were used. The algorithm was run for $0.05d$ iterates, where $d$ is the input dimension.
  • ...and 16 more figures

Theorems & Definitions (1)

  • Definition 4.1: Learning from perturbations