Table of Contents
Fetching ...

Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation

Aahlad Puli, Nitish Joshi, Yoav Wald, He He, Rajesh Ranganath

TL;DR

This work tackles the problem of spurious correlations arising from nuisances whose label relationships shift across domains. It introduces semantic corruptions as a principled data augmentation approach to reveal nuisance-label correlations and to train models that rely less on nuisances, enabling robustness to distribution shifts. The authors formalize conditions under which semantic corruptions enable risk-invariant learning and demonstrate across Waterbirds, NLI, and cardiomegaly detection that corruption-powered methods can outperform vanilla baselines and approach nuisance-informed baselines. They also connect semantic corruptions to existing debiasing frameworks like PoE/DFL and JTT, showing competitive improvements in worst-group performance. The results suggest semantic corruptions offer a practical, annotation-light path to broaden the set of tasks for which robust generalization is achievable, with validation guiding corruption strength to preserve useful nuisances while destroying semantic cues.

Abstract

In prediction tasks, there exist features that are related to the label in the same way across different settings for that task; these are semantic features or semantics. Features with varying relationships to the label are nuisances. For example, in detecting cows from natural images, the shape of the head is semantic but because images of cows often have grass backgrounds but not always, the background is a nuisance. Models that exploit nuisance-label relationships face performance degradation when these relationships change. Building models robust to such changes requires additional knowledge beyond samples of the features and labels. For example, existing work uses annotations of nuisances or assumes ERM-trained models depend on nuisances. Approaches to integrate new kinds of additional knowledge enlarge the settings where robust models can be built. We develop an approach to use knowledge about the semantics by corrupting them in data, and then using the corrupted data to produce models which identify correlations between nuisances and the label. Once these correlations are identified, they can be used to adjust for where nuisances drive predictions. We study semantic corruptions in powering different spurious-correlation avoiding methods on multiple out-of-distribution (OOD) tasks like classifying waterbirds, natural language inference (NLI), and detecting cardiomegaly in chest X-rays.

Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation

TL;DR

This work tackles the problem of spurious correlations arising from nuisances whose label relationships shift across domains. It introduces semantic corruptions as a principled data augmentation approach to reveal nuisance-label correlations and to train models that rely less on nuisances, enabling robustness to distribution shifts. The authors formalize conditions under which semantic corruptions enable risk-invariant learning and demonstrate across Waterbirds, NLI, and cardiomegaly detection that corruption-powered methods can outperform vanilla baselines and approach nuisance-informed baselines. They also connect semantic corruptions to existing debiasing frameworks like PoE/DFL and JTT, showing competitive improvements in worst-group performance. The results suggest semantic corruptions offer a practical, annotation-light path to broaden the set of tasks for which robust generalization is achievable, with validation guiding corruption strength to preserve useful nuisances while destroying semantic cues.

Abstract

In prediction tasks, there exist features that are related to the label in the same way across different settings for that task; these are semantic features or semantics. Features with varying relationships to the label are nuisances. For example, in detecting cows from natural images, the shape of the head is semantic but because images of cows often have grass backgrounds but not always, the background is a nuisance. Models that exploit nuisance-label relationships face performance degradation when these relationships change. Building models robust to such changes requires additional knowledge beyond samples of the features and labels. For example, existing work uses annotations of nuisances or assumes ERM-trained models depend on nuisances. Approaches to integrate new kinds of additional knowledge enlarge the settings where robust models can be built. We develop an approach to use knowledge about the semantics by corrupting them in data, and then using the corrupted data to produce models which identify correlations between nuisances and the label. Once these correlations are identified, they can be used to adjust for where nuisances drive predictions. We study semantic corruptions in powering different spurious-correlation avoiding methods on multiple out-of-distribution (OOD) tasks like classifying waterbirds, natural language inference (NLI), and detecting cardiomegaly in chest X-rays.
Paper Structure (53 sections, 4 theorems, 24 equations, 3 figures, 16 tables, 1 algorithm)

This paper contains 53 sections, 4 theorems, 24 equations, 3 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

For any learning algorithm, there exists a nuisance-varying family $\mathcal{F}$ where predicting with ${p_{\scaleto{\hbox{o}rigin=c]{90}{$$}}{4pt}}}(\boldsymbol{\mathbf{y}}=1 ~\vert~ \boldsymbol{\mathbf{x}})$ achieves $90\%$ accuracy on all members such that given training data $\boldsymbol{\mathbf

Figures (3)

  • Figure 1: Semantic corruptions of Waterbirds via and chest X-rays via rm.
  • Figure 2: Semantic corruptions of chest X-rays via and respectively.
  • Figure 3: Example of of a chest X-ray image. The image is followed by of size $112, 56, 28, 14, 7, 2$.

Theorems & Definitions (9)

  • Definition 1
  • Theorem 1
  • Definition 2: Semantic Corruption
  • Proposition 1
  • Theorem 1
  • proof
  • Definition 3: Semantic Corruption
  • Proposition 1
  • proof