Table of Contents
Fetching ...

Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

Amartya Sanyal, Yaxi Hu, Yaodong Yu, Yian Ma, Yixin Wang, Bernhard Schölkopf

TL;DR

The paper questions the robustness of the Accuracy-on-the-line phenomenon (the positive ID–OOD accuracy correlation) under realistic data issues like label noise and nuisance/spurious features. It combines a theoretical linear-model analysis with empirical demonstrations on Colored MNIST and fMoW to derive sufficient conditions under which Accuracy-on-the-wrong-line emerges and shows how scaling can worsen the effect. The key contributions include a formal data-model framework with disjoint signal/nuisance subspaces, a three-condition criterion for when the phenomenon breaks, and an experimental ablation validating the theory. The work highlights practical risks of relying on large noisy datasets for generalization and motivates approaches to mitigate memorization of noise and spurious correlations.

Abstract

"Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.

Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

TL;DR

The paper questions the robustness of the Accuracy-on-the-line phenomenon (the positive ID–OOD accuracy correlation) under realistic data issues like label noise and nuisance/spurious features. It combines a theoretical linear-model analysis with empirical demonstrations on Colored MNIST and fMoW to derive sufficient conditions under which Accuracy-on-the-wrong-line emerges and shows how scaling can worsen the effect. The key contributions include a formal data-model framework with disjoint signal/nuisance subspaces, a three-condition criterion for when the phenomenon breaks, and an experimental ablation validating the theory. The work highlights practical risks of relying on large noisy datasets for generalization and motivates approaches to mitigate memorization of noise and spurious correlations.

Abstract

"Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
Paper Structure (17 sections, 10 theorems, 38 equations, 10 figures, 1 table)

This paper contains 17 sections, 10 theorems, 38 equations, 10 figures, 1 table.

Key Result

Theorem 1

For any $S_d,S_k$ let $\mathcal{D}$ be a $(S_d, S_k)$-disjoint signal distribution, and $\Delta$ be a $S_d$-oblivious shift distribution where each coordinate is an independent subgaussian with parameter $\sigma$. Then, for any $x\in\mathrm{dom}\left({\mu}\right)$ and $\widehat{w} \in \mathbb{R}^{d+ for all $x\in\mathrm{dom}\left({\mu}\right)$ where $\left\langle{\widehat{w}},{x}\right\rangle\geq

Figures (10)

  • Figure 1: Accuracy-on-the-wrong-line behaviour in Noisy dataset vs. Accuracy-on-the-line behaviour in Noiseless dataset in linear setting. See \ref{['sec:linear']} for a description of the setting.
  • Figure 2: Same setting as \ref{['fig:lin-id-vs-ood']}, increasing dataset size always increases accuracy irrespective of label noise, but decreases accuracy in the presence of label noise.
  • Figure 3: Each plot shows vs accuracy for varying label noise rates $\eta$ on the colored MNIST dataset. Similar to \ref{['fig:lin-id-vs-ood']}, the Accuracy-on-the-line phenomenon degrades with increasing amount of label noise.
  • Figure 4: Experiments on the domain-correlated dataset with label noise. The noisy dataset (left) shows the Accuracy-on-the-wrong-line phenomenon, while the noiseless dataset (center) shows the Accuracy-on-the-line phenomenon. When the noisy dataset is not interpolated e.g. due to early stopping (right), the Accuracy-on-the-line phenomenon persists.
  • Figure 5: \ref{['fig:var-noise-sens']} shows as the amount of label noise increases, nuisance sensitivity as well as the nuisance density increases faster with larger dataset sizes. This leads to worse accuracy as shown in \ref{['fig:var-noise-ood']}. However, accuracy still increases with dataset size as shown in \ref{['fig:var-noise-id']}.
  • ...and 5 more figures

Theorems & Definitions (17)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Corollary 2
  • Proposition 3: Informal
  • Theorem 3
  • proof
  • Corollary 3
  • proof
  • Theorem 4
  • ...and 7 more