Table of Contents
Fetching ...

Identifying Spurious Correlations using Counterfactual Alignment

Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari

TL;DR

This work tackles spurious correlations that undermine generalization in vision models by proposing counterfactual (CF) alignment, which uses gradient-guided perturbations in a shared autoencoder latent space to generate counterfactuals and assess how downstream classifiers respond. The method quantifies feature-sharing and potential biases with a relative change metric, e.g., $z_\lambda = z_0 - \lambda \frac{\partial f_b(D(z_0))}{\partial z}$ and $\text{RelativeChange}(f_1,f_b,z_0)$, enabling both global statistics and instance-level inspection. The authors validate CF alignment on CelebA/CelebA-HQ and Waterbirds, demonstrate induced spurious correlations, and show that robust optimization techniques like GroupDRO and FLAC reduce the learned spurious feature usage, improving accuracy. CF alignment thus provides a practical, scalable diagnostic tool for detecting, quantifying, and mitigating spurious correlations in black-box classifiers.

Abstract

Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.

Identifying Spurious Correlations using Counterfactual Alignment

TL;DR

This work tackles spurious correlations that undermine generalization in vision models by proposing counterfactual (CF) alignment, which uses gradient-guided perturbations in a shared autoencoder latent space to generate counterfactuals and assess how downstream classifiers respond. The method quantifies feature-sharing and potential biases with a relative change metric, e.g., and , enabling both global statistics and instance-level inspection. The authors validate CF alignment on CelebA/CelebA-HQ and Waterbirds, demonstrate induced spurious correlations, and show that robust optimization techniques like GroupDRO and FLAC reduce the learned spurious feature usage, improving accuracy. CF alignment thus provides a practical, scalable diagnostic tool for detecting, quantifying, and mitigating spurious correlations in black-box classifiers.

Abstract

Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.
Paper Structure (19 sections, 7 equations, 12 figures, 3 tables)

This paper contains 19 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of the alignment methodology. An image is encoded, reconstructed, and then processed by a classifier. The counterfactual is generated by subtracting the gradient of the classifier output w.r.t. the latent representation. The resulting representation is reconstructed back into an image. The reconstructed images are processed with multiple classifiers and the classifier outputs can be plotted side by side to study their alignment. The base model value can be used as the x-axis to more easily compare it to the predictions of another classifier. The output changes can be quantified and compared using relative change.
  • Figure 2: Relationships between face-attribute classifiers as measured by CF-alignment relative changes (left), classifier predictions (middle) and training data labels (right). In (a), base classifiers are along the rows and downstream classifiers are along the columns. Comparing (a) to (b) and (c), shows that many relationships reflected in the CF outputs are preserved from correlations in the training data. We draw the readers attention to some unique differences. The relationship between male and big_nose, highlighted in red, is strong in both the classifier predictions and ground truth labels but low in CF alignment, indicating that although correlated, these features are not exploited by the classifier. In contrast to this, the relationship between pointy_nose and smiling, highlighted in green, is weak in both the classifier predictions and ground truth labels but high in CF alignment, indicating that this relationship was introduced by the classifier.
  • Figure 3: CF alignment examples for the $f_b$=pointy_nose with the highest aligned and inverse aligned classifiers. We observe an inverse alignment with big_nose and potential spurious relationships with eyebrows, eyes, hair, and smiling. The relative change is shown next to each classifier name.
  • Figure 4: Example of detecting a spurious correlation in a biased base classifier. The classifier is biased with arched eyebrows and this is observed in the alignment plot as well as in the counterfactual image. The relative change is now 0.97 compared to 0.01 for the unchanged smiling classifier.
  • Figure 5: Counterfactuals generated for models trained on the waterbirds dataset together with CF alignment plots. (a) and (b) The relative change between the waterbird classifier and the background classifier is shown in the lower right of the plot. (c) The nostril size is reduced in the counterfactual for the DRO model indicating that a larger nostril size is associated with waterbirds.
  • ...and 7 more figures