Table of Contents
Fetching ...

Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

Ole Delzer, Sidney Bender

Abstract

Deep Neural Networks (DNNs) are increasingly utilized in high-stakes domains like medical diagnostics and autonomous driving where model reliability is critical. However, the research landscape for ensuring this reliability is terminologically fractured across communities that pursue the same goal of ensuring models rely on causally relevant features rather than confounding signals. While frameworks such as distributionally robust optimization (DRO), invariant risk minimization (IRM), shortcut learning, simplicity bias, and the Clever Hans effect all address model failure due to spurious correlations, researchers typically only reference work within their own domains. This reproducibility study unifies these perspectives through a comparative analysis of correction methods under challenging constraints like limited data availability and severe subgroup imbalance. We evaluate recently proposed correction methods based on explainable artificial intelligence (XAI) techniques alongside popular non-XAI baselines using both synthetic and real-world datasets. Findings show that XAI-based methods generally outperform non-XAI approaches, with Counterfactual Knowledge Distillation (CFKD) proving most consistently effective at improving generalization. Our experiments also reveal that the practical application of many methods is hindered by a dependency on group labels, as manual annotation is often infeasible and automated tools like Spectral Relevance Analysis (SpRAy) struggle with complex features and severe imbalance. Furthermore, the scarcity of minority group samples in validation sets renders model selection and hyperparameter tuning unreliable, posing a significant obstacle to the deployment of robust and trustworthy models in safety-critical areas.

Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

Abstract

Deep Neural Networks (DNNs) are increasingly utilized in high-stakes domains like medical diagnostics and autonomous driving where model reliability is critical. However, the research landscape for ensuring this reliability is terminologically fractured across communities that pursue the same goal of ensuring models rely on causally relevant features rather than confounding signals. While frameworks such as distributionally robust optimization (DRO), invariant risk minimization (IRM), shortcut learning, simplicity bias, and the Clever Hans effect all address model failure due to spurious correlations, researchers typically only reference work within their own domains. This reproducibility study unifies these perspectives through a comparative analysis of correction methods under challenging constraints like limited data availability and severe subgroup imbalance. We evaluate recently proposed correction methods based on explainable artificial intelligence (XAI) techniques alongside popular non-XAI baselines using both synthetic and real-world datasets. Findings show that XAI-based methods generally outperform non-XAI approaches, with Counterfactual Knowledge Distillation (CFKD) proving most consistently effective at improving generalization. Our experiments also reveal that the practical application of many methods is hindered by a dependency on group labels, as manual annotation is often infeasible and automated tools like Spectral Relevance Analysis (SpRAy) struggle with complex features and severe imbalance. Furthermore, the scarcity of minority group samples in validation sets renders model selection and hyperparameter tuning unreliable, posing a significant obstacle to the deployment of robust and trustworthy models in safety-critical areas.

Paper Structure

This paper contains 44 sections, 13 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Research scope overview. A: Sketch illustrating how a confounder that is spuriously correlated with the class labels due to group imbalance can lead to Clever Hans behavior. Majority groups (green area) generalize well, while unseen minority group data (red area) is systematically misclassified. The few training samples in the minority group can easily be overfitted and hence do not prevent the classifier from learning a Clever Hans strategy. B: Conceptual visualization of the selected correction methods. C: Overview of the datasets used in our experiments, highlighting the distinction between the causal (x-axis) and the confounding (y-axis) feature. D: Identification of valid and Clever Hans decision strategies with the help of LRP and SpRAy; corresponding annotation of whole data point clusters with group labels. E: Application of the correction methods, followed by an evaluation and comparison of the correction quality. This specific example shows the decision boundary and confidence regions of the classifier trained on biased Squares before and after applying RR-ClArC. The ideal decision boundary would be a vertical line in the center, perfectly separating the samples by the value of the causal feature.
  • Figure 2: Selected samples from the CelebA dataset paired with their respective heatmaps as an overlay over the original image. The classifier was trained to distinguish between images of smiling and non-smiling persons. A watermark was added in the bottom right of each image, but the transparency of the watermark is correlated with the class labels. For images with the label Smiling, the watermark is bolder in most cases, and for images with the label Not Smiling, it is usually more transparent. Thus, the model learns to exploit the watermark as a Clever Hans feature, which can clearly be seen in the heatmap overlays: For the image on the very right in the Not Smiling row, we can see that the model correctly considered the mouth of the person to contribute to the Not Smiling class. However, the bold watermark actually had a negative contribution, possibly leading to a false classification. Conversely, for the Smiling class, a bold watermark acts as a strong positive indicator, sometimes leading the model to ignore the shape of the mouth completely, as e.g. in the first and second images in the bottom row.
  • Figure 3: Result of a Spectral Relevance Analysis on Class '0' samples of the Colored MNIST dataset, where 50% of samples from Class '0' are colored red, while all other samples are colored white. The spectral embedding is visualized with the help of t-SNE. It is clearly visible that the samples are arranged in two distinct clusters, representing the two decision strategies learned by the model: classification by coloring (left side), which constitutes a Clever Hans effect, and classification by the shape of the depicted digit (right side), which is the correct, causal concept that we actually want our model to base its decision on.
  • Figure 4: Sketch illustrating how A-ClArC and P-ClArC use a CAV to (a) add the confounder to or (b) remove the confounder from all samples. During training, there are no examples from class B containing the confounder available, so the model strongly associates the confounder with class A, resulting in a Clever Hans decision boundary. During inference, if there are new data points from class B that do contain the confounder, they will get misclassified. To fix this Clever Hans behavior, both A-ClArC and P-ClArC project all data points onto a hyperplane that is orthogonal to the direction associated with the confounder (captured by the CAV), so that it can no longer be used by the model as a discriminative feature between the two classes. However, in this setting, P-ClArC comes with the advantage that no additional fine-tuning is necessary, because for data points on the hyperplane, the true decision boundary and the confounded decision boundary already result in the same classification. A-ClArC requires fine-tuning, because otherwise, all projected data points from class B would get misclassified, since the true decision boundary and the Clever Hans decision boundary deviate for class B samples containing the confounder.
  • Figure 5: Counterfactual Explanations generated for source class Not Smiling and target class Smiling. From left to right, each panel shows the original image, its counterfactual, and the per-pixel difference between the original and the counterfactual. (a) shows a true counterfactual explanation, i.e. the Counterfactual Explainer changed the causal feature that distinguishes the source and target class by altering the mouth area in such a way that the depicted person is now actually smiling. (b) shows a false counterfactual, as the causal feature remains unchanged and instead the opacity of the watermark is increased, hinting at the model possibly being affected by Clever Hans.
  • ...and 22 more figures