Identifying Spurious Correlations using Counterfactual Alignment
Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
TL;DR
This work tackles spurious correlations that undermine generalization in vision models by proposing counterfactual (CF) alignment, which uses gradient-guided perturbations in a shared autoencoder latent space to generate counterfactuals and assess how downstream classifiers respond. The method quantifies feature-sharing and potential biases with a relative change metric, e.g., $z_\lambda = z_0 - \lambda \frac{\partial f_b(D(z_0))}{\partial z}$ and $\text{RelativeChange}(f_1,f_b,z_0)$, enabling both global statistics and instance-level inspection. The authors validate CF alignment on CelebA/CelebA-HQ and Waterbirds, demonstrate induced spurious correlations, and show that robust optimization techniques like GroupDRO and FLAC reduce the learned spurious feature usage, improving accuracy. CF alignment thus provides a practical, scalable diagnostic tool for detecting, quantifying, and mitigating spurious correlations in black-box classifiers.
Abstract
Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.
