Table of Contents
Fetching ...

Label Noise: Ignorance Is Bliss

Yilun Zhu, Jianxin Zhang, Aditya Gangrade, Clayton Scott

TL;DR

The concept of relative signal strength (RSS) is introduced, a pointwise measure that quantifies the transferability from noisy to clean posterior, and supports the simple NI-ERM principle, which minimizes empirical risk while ignoring label noise.

Abstract

We establish a new theoretical framework for learning under multi-class, instance-dependent label noise. This framework casts learning with label noise as a form of domain adaptation, in particular, domain adaptation under posterior drift. We introduce the concept of \emph{relative signal strength} (RSS), a pointwise measure that quantifies the transferability from noisy to clean posterior. Using RSS, we establish nearly matching upper and lower bounds on the excess risk. Our theoretical findings support the simple \emph{Noise Ignorant Empirical Risk Minimization (NI-ERM)} principle, which minimizes empirical risk while ignoring label noise. Finally, we translate this theoretical insight into practice: by using NI-ERM to fit a linear classifier on top of a self-supervised feature extractor, we achieve state-of-the-art performance on the CIFAR-N data challenge.

Label Noise: Ignorance Is Bliss

TL;DR

The concept of relative signal strength (RSS) is introduced, a pointwise measure that quantifies the transferability from noisy to clean posterior, and supports the simple NI-ERM principle, which minimizes empirical risk while ignoring label noise.

Abstract

We establish a new theoretical framework for learning under multi-class, instance-dependent label noise. This framework casts learning with label noise as a form of domain adaptation, in particular, domain adaptation under posterior drift. We introduce the concept of \emph{relative signal strength} (RSS), a pointwise measure that quantifies the transferability from noisy to clean posterior. Using RSS, we establish nearly matching upper and lower bounds on the excess risk. Our theoretical findings support the simple \emph{Noise Ignorant Empirical Risk Minimization (NI-ERM)} principle, which minimizes empirical risk while ignoring label noise. Finally, we translate this theoretical insight into practice: by using NI-ERM to fit a linear classifier on top of a self-supervised feature extractor, we achieve state-of-the-art performance on the CIFAR-N data challenge.

Paper Structure

This paper contains 36 sections, 11 theorems, 129 equations, 4 figures, 8 tables.

Key Result

Proposition 1

$\mathcal{A}_{0} (\bm{\eta}\xspace, \widetilde{\bm{\eta}\xspace}) = \{x \in \mathcal{X}: \underset{}{\operatorname{arg \ max \ }} \widetilde{\bm{\eta}\xspace}(x) \subseteq \underset{}{\operatorname{arg \ max \ }} {\bm{\eta}\xspace}(x) \}.$

Figures (4)

  • Figure 1: Illustration of relative signal strength for binary classification. Left: clean and noisy posteriors $[\bm{\eta}\xspace(x)]_1 = \mathbb P\left( Y = 1 | X = x \right)$ and $[\widetilde{\bm{\eta}\xspace}(x)]_1 = \mathbb{P} ( \widetilde{Y} = 1 | X = x )$. Right: relative signal strength corresponding to these posteriors. The gray region, $x \in (0, 5)$, is where the true and noisy Bayes classifiers differ, and is also the zero signal region $\mathcal{X} \setminus \mathcal{A}_0$. The red region is $\mathcal{A}_{0.4}$, where the RSS is $> 0.4$. Note that as $x \uparrow 0, \mathcal{M}(x;\bm{\eta}\xspace, \widetilde{\bm{\eta}\xspace}) \uparrow \infty$, which occurs since $[\bm{\eta}\xspace(x)]_1 \uparrow 1/2,$ while $[\widetilde{\bm{\eta}\xspace}]_1$ is far from $1/2$. For $x = 0^+,$ the predicted labels under $\bm{\eta}\xspace$ and $\widetilde{\bm{\eta}\xspace}$ disagree, and the RSS crashes to $0.$
  • Figure 2: Data simulation that verifies noise immunity. For binary, the turning point is at noise rate $\mathbb{P} (\widetilde{Y} \neq Y ) = 0.5$. For $10$-class, the turning point is at $\mathbb{P} (\widetilde{Y} \neq Y ) = 0.9$.
  • Figure 3: A linear model trained on features obtained from either transfer learning (pretrained ResNet-50 on ImageNet he2016deep ), self-supervised learning (ResNet-50 trained on CIFAR-10 images with contrastive loss chen2020simple), or a pretrained self-supervised foundation model DINOv2 oquab2023dinov2 significantly boosts the performance of the original linear model. In contrast, full training of a ResNet-50 leads to overfitting.
  • Figure 4: Empirical CDF of estimated RSS for CIFAR-10N, evaluated on test data.

Theorems & Definitions (20)

  • Definition 1: Relative Signal Strength
  • Example 1
  • Example 2
  • Example 3: Comparison to KL divergence
  • Proposition 1
  • Proposition 2
  • Example 4
  • Theorem 1: Minimax Lower Bound
  • Lemma 1: Oracle Inequality
  • Theorem 2: Excess Risk Upper Bound of NI-ERM
  • ...and 10 more