Table of Contents
Fetching ...

Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation

Fahimeh Hosseini Noohdani, Parsa Hosseini, Aryan Yazdan Parast, Hamidreza Yaghoubi Araghi, Mahdieh Soleymani Baghshah

TL;DR

Decompose-and-Compose (DaC) tackles spurious correlations under distribution shifts caused by image compositionality by using attribution-guided adaptive masking to identify causal regions and then creating counterfactual, mixed-image samples to balance underrepresented groups. By retraining only the last layer on a dataset augmented with these combined images, DaC achieves strong worst-group performance without requiring group labels, outperforming several label-aware baselines on multiple benchmarks. The approach is interpretable, data-efficient, and broadly applicable to diverse spurious cues, with ablations showing that combining intensity and the proportion of selected data critically affect robustness. Overall, DaC advances practical out-of-distribution generalization in vision by leveraging causality-informed decomposition and compositional data augmentation.

Abstract

While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data, it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically, in addition to the main object or component(s) determining the label, some other image components usually exist, which may lead to the shift of input distribution between train and test environments. More importantly, these components may have spurious correlations with the label. To address this issue, we propose Decompose-and-Compose (DaC), which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations, models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact, according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components, the model usually attends to one of these more (on samples with high confidence). Following this, we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward, we intervene on images by combining them and retraining the model on the augmented data, including the counterfactual ones. Along with its high interpretability, this work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift.

Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation

TL;DR

Decompose-and-Compose (DaC) tackles spurious correlations under distribution shifts caused by image compositionality by using attribution-guided adaptive masking to identify causal regions and then creating counterfactual, mixed-image samples to balance underrepresented groups. By retraining only the last layer on a dataset augmented with these combined images, DaC achieves strong worst-group performance without requiring group labels, outperforming several label-aware baselines on multiple benchmarks. The approach is interpretable, data-efficient, and broadly applicable to diverse spurious cues, with ablations showing that combining intensity and the proportion of selected data critically affect robustness. Overall, DaC advances practical out-of-distribution generalization in vision by leveraging causality-informed decomposition and compositional data augmentation.

Abstract

While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data, it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically, in addition to the main object or component(s) determining the label, some other image components usually exist, which may lead to the shift of input distribution between train and test environments. More importantly, these components may have spurious correlations with the label. To address this issue, we propose Decompose-and-Compose (DaC), which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations, models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact, according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components, the model usually attends to one of these more (on samples with high confidence). Following this, we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward, we intervene on images by combining them and retraining the model on the augmented data, including the counterfactual ones. Along with its high interpretability, this work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift.
Paper Structure (31 sections, 4 equations, 14 figures, 6 tables, 2 algorithms)

This paper contains 31 sections, 4 equations, 14 figures, 6 tables, 2 algorithms.

Figures (14)

  • Figure 1: Behaviour of a model trained with standard ERM in different datasets. Based on the easiness of inferring the label from the causal or non-causal parts across the whole dataset, the model attends more to one of them, this behaviour is more evident in samples on which the model has a low loss. (a), (b) Average xGradCAM score of Cifar10 (causal) and MNIST (non-causal) pixels in four loss quantiles of the Dominoes training set. The model generally attends more to the non-causal parts, and as the loss decreases, the non-causal attention increases. (c), (d) Average xGradCAM score of foreground (causal) and background (non-causal) pixels in four loss quantiles of the Waterbirds training set. The model generally attends to the causal parts, and as the loss decreases, the causal attention increases.
  • Figure 2: (a) Image as a composition of causal and non-causal components. (b) The edge between $S$ and $\tilde{S}$ can be removed by intervention on components in $\tilde{S}$. This removes the spurious correlation between $\Tilde{S}$ and $Y$.
  • Figure 3: Adaptive masking according to the attention scores obtained from the ERM model. The loss value of the masked images for which different portions $p$ of pixels (with the lowest attention score) has been masked is shown as $l_p$. (a) The loss curve for an image of the Dominoes dataset with the label 'truck' on which the ERM model has non-causal attention, and (b) The loss curve for an image of the MetaShift dataset with the label 'dog' on which the ERM model has causal attention.
  • Figure 4: (a) An overview of our DaC method. For each batch, a $q$ portion of samples with the lowest loss is selected. Then images of different labels are combined by the Mask and Combine module. The overall loss to update the model's last layer parameters is a weighted sum of the loss on the original batch ($L_{CE}$) and the combined data ($L_{\text{comb}}$). The algorithm for this method is shown in \ref{['alg:MaC']}. (b) The Mask and Combine module. The two input images $x^{(i)}$ and $x^{(j)}$ are masked by \ref{['alg: psudocode adaptive']}. Afterwards, The selected part of $x^{(i)}$ and the masked parts of $x^{(j)}$ are combined, and the remaining gaps are filled with the mean value of the batch. The new combined image has the same label as $x^{(i)}$ and is used for training the last layer of the model.
  • Figure 5: Worst group accuracy on different datasets with respect to $\alpha$. $\alpha\geq 1$ is enough to increase worst group accuracy rapidly.
  • ...and 9 more figures