Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations
Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi
TL;DR
The paper addresses the misleading stability of accuracy-on-the-line (AoTL) in OOD generalization by showing aggregation hides accuracy-on-the-inverse-line (AoTIL) subsets where higher in-distribution accuracy predicts lower OOD accuracy. It introduces OODSelect, a gradient-based method to identify large, semantically coherent OOD subsets that decouple ID/OOD performance and reveal spurious correlations, often comprising a sizable fraction of the OOD data. Across multiple benchmarks (DomainBed, WILDS, TerraIncognita, Chest X-ray datasets), AoTIL subsets emerge and yield negative or weak correlations, contrasting with the strong positive AoTL observed on full OOD sets. The work provides code and subsets to enable robust evaluation, calling for subset-aware benchmarking to better diagnose and improve OOD robustness against spurious correlations.
Abstract
Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.
