Training on Plausible Counterfactuals Removes Spurious Correlations
Shpresim Sadiku, Kartikeya Chitranshi, Hiroshi Kera, Sebastian Pokutta
TL;DR
This work investigates plausible counterfactual explanations (p-CFEs) as training data, showing that labeling minimal, plausible perturbations with induced incorrect targets enables classifiers to achieve high accuracy on clean inputs while markedly reducing reliance on spurious correlations. By formulating p-CFE generation with a differentiable plausibility term and solving via proximal gradient methods, the authors demonstrate that models trained on p-CFEs outperform those trained with standard data and other perturbation schemes in mitigating background-based biases. The key finding is that p-CFEs not only flip predictions but also steer models toward learning semantic, data-aligned features, leading to improved worst-group performance on spurious-dataset benchmarks such as WaterBirds. The approach is data-efficient and model-agnostic, with potential to scale to higher-dimensional data and larger models, offering practical implications for robust and fair learning without requiring group labels.
Abstract
Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.
