Table of Contents
Fetching ...

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi

TL;DR

The paper addresses the misleading stability of accuracy-on-the-line (AoTL) in OOD generalization by showing aggregation hides accuracy-on-the-inverse-line (AoTIL) subsets where higher in-distribution accuracy predicts lower OOD accuracy. It introduces OODSelect, a gradient-based method to identify large, semantically coherent OOD subsets that decouple ID/OOD performance and reveal spurious correlations, often comprising a sizable fraction of the OOD data. Across multiple benchmarks (DomainBed, WILDS, TerraIncognita, Chest X-ray datasets), AoTIL subsets emerge and yield negative or weak correlations, contrasting with the strong positive AoTL observed on full OOD sets. The work provides code and subsets to enable robust evaluation, calling for subset-aware benchmarking to better diagnose and improve OOD robustness against spurious correlations.

Abstract

Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

TL;DR

The paper addresses the misleading stability of accuracy-on-the-line (AoTL) in OOD generalization by showing aggregation hides accuracy-on-the-inverse-line (AoTIL) subsets where higher in-distribution accuracy predicts lower OOD accuracy. It introduces OODSelect, a gradient-based method to identify large, semantically coherent OOD subsets that decouple ID/OOD performance and reveal spurious correlations, often comprising a sizable fraction of the OOD data. Across multiple benchmarks (DomainBed, WILDS, TerraIncognita, Chest X-ray datasets), AoTIL subsets emerge and yield negative or weak correlations, contrasting with the strong positive AoTL observed on full OOD sets. The work provides code and subsets to enable robust evaluation, calling for subset-aware benchmarking to better diagnose and improve OOD robustness against spurious correlations.

Abstract

Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.

Paper Structure

This paper contains 23 sections, 4 theorems, 41 equations, 12 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Let $(\mathbf{z}_i,\mathbf{w}_i)_{i=1}^{N}\subseteq[\alpha,1-\alpha]^2$ with $\alpha\in(0,1)$. Define and let be the sample Pearson correlation of the first $N$ transformed pairs. Add one more pair $(\mathbf{z}_{N+1},\mathbf{w}_{N+1})$ with $\mathbf{z}_{N+1}\in[\alpha,1-\alpha]$ and $\mathbf{w}_{N+1}=\beta\,\mathbf{z}_{N+1}$, and denote the updated correlation by $\rho_{N+1}$. Then where the co

Figures (12)

  • Figure 1: Aggregation Masking AoTIL. Consider Terra Incognita, where ID data are drawn from camera traps at locations L100, L38, L43, and OOD data from L46beery2018recognition. Aggregation masks the effect of spurious correlations on generalization, such as daylight, even though a substantial number of OOD samples are still systematically misclassified. Note that OODSelect examples differ from the most misclassified examples, which always have an ID-OOD accuracy correlation of near zero. Confidence intervals correspond to 95% Fisher z-intervals.
  • Figure 2: Comparing AoTL and AoTIL. Pearson Correlation between ID and OOD accuracy as a function of the number of selected OOD samples. Correlation values above 0.3 indicate AoTL, while below -0.3 is AoTIL---correlations in between are considered weak. We compare a Random Selection of data samples and the Most Misclassified at fixed size intervals from 100 to over 100,000 (normalized to sample size in the figure). Random selections yield strong positive correlation, while misclassified samples have weak correlations; that is, our method does not conflate spurious correlations with general difficulty (e.g., label noise). OODSelect identifies subsets where ID and OOD accuracy are negatively correlated---in one case (CXR) for over 70% of the usual OOD dataset. This behavior is dataset-dependent due to differences in distributional properties. Table \ref{['tab:corrs']} enumerates detailed correlations.
  • Figure 3: CXR No Finding. Figure \ref{['fig:cxr_all_sub']} suggests that poor generalization may arise for a large subset of the OOD population from reliance on spurious correlations. However, aggregation hides this failure mode since the correlation for the full OOD set is strongly positive. This selected subset also has a prevalence shift from the full dataset; statistical significance for the prevalence shift was assessed using bootstrapping with 1000 resamples.
  • Figure 4: The correlation directions are not driven by outliers --- Spearman rank is robust to outliers while Pearson R is not. Still, the trends are similar (full results in Figure \ref{['fig:corr_summary']} and \ref{['fig:spearman_corr_summary']}).
  • Figure 5: Independent Architecture Families. Our findings hold even when train/test models are from disjoint architecture families, e.g., ResNets vs. ViTs.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Definition 1: Correlation Property; miller2021accuracy
  • Lemma 1: Bounded Effect of a New Model on Pearson $R$
  • proof
  • Lemma 2: Bounded Effect of a New Example on Pearson $R$
  • proof
  • Lemma 3: Lipschitz Continuity of Pearson Correlation w.r.t. $\mathbf{s}$
  • proof
  • Proposition 1: Non-Submodularity; no diminishing returns (Informal)
  • proof