Table of Contents
Fetching ...

Label Noise: Correcting the Forward-Correction

William Toner, Amos Storkey

TL;DR

Label Noise: Correcting the Forward-Correction addresses learning under label noise by highlighting that forward-corrected losses, while consistent under a known noise model, can overfit finite noisy datasets. It introduces a principled loss-bounding strategy, deriving a noise-bound $B(\eta,c)$ from the average entropy of noisy class posteriors under a separability assumption and applying it as a lower-bound maintenance in training. The work generalises forward-correction to non-linear noise models and contextualises popular robust losses (e.g., GCE, SCE) as generalized forward-corrected losses, unifying their interpretation. Empirically, the proposed noise-bounded loss improves robustness across diverse datasets and noise types with minimal overhead, though performance depends on accurate noise-rate estimation and the validity of the separability assumption.

Abstract

Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels. To address this issue, researchers have explored alternative loss functions that aim to be more robust. The `forward-correction' is a popular approach wherein the model outputs are noised before being evaluated against noisy data. When the true noise model is known, applying the forward-correction guarantees consistency of the learning algorithm. While providing some benefit, the correction is insufficient to prevent overfitting to finite noisy datasets. In this work, we propose an approach to tackling overfitting caused by label noise. We observe that the presence of label noise implies a lower bound on the noisy generalised risk. Motivated by this observation, we propose imposing a lower bound on the training loss to mitigate overfitting. Our main contribution is providing theoretical insights that allow us to approximate the lower bound given only an estimate of the average noise rate. We empirically demonstrate that using this bound significantly enhances robustness in various settings, with virtually no additional computational cost.

Label Noise: Correcting the Forward-Correction

TL;DR

Label Noise: Correcting the Forward-Correction addresses learning under label noise by highlighting that forward-corrected losses, while consistent under a known noise model, can overfit finite noisy datasets. It introduces a principled loss-bounding strategy, deriving a noise-bound from the average entropy of noisy class posteriors under a separability assumption and applying it as a lower-bound maintenance in training. The work generalises forward-correction to non-linear noise models and contextualises popular robust losses (e.g., GCE, SCE) as generalized forward-corrected losses, unifying their interpretation. Empirically, the proposed noise-bounded loss improves robustness across diverse datasets and noise types with minimal overhead, though performance depends on accurate noise-rate estimation and the validity of the separability assumption.

Abstract

Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels. To address this issue, researchers have explored alternative loss functions that aim to be more robust. The `forward-correction' is a popular approach wherein the model outputs are noised before being evaluated against noisy data. When the true noise model is known, applying the forward-correction guarantees consistency of the learning algorithm. While providing some benefit, the correction is insufficient to prevent overfitting to finite noisy datasets. In this work, we propose an approach to tackling overfitting caused by label noise. We observe that the presence of label noise implies a lower bound on the noisy generalised risk. Motivated by this observation, we propose imposing a lower bound on the training loss to mitigate overfitting. Our main contribution is providing theoretical insights that allow us to approximate the lower bound given only an estimate of the average noise rate. We empirically demonstrate that using this bound significantly enhances robustness in various settings, with virtually no additional computational cost.
Paper Structure (41 sections, 13 theorems, 53 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 41 sections, 13 theorems, 53 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Lemma 2.2

The GCE, SCE and forward-corrected CE (denoted FCE) loss functions can be formulated as generalised forward-corrected losses with a proper base loss. The noise models $f_{GCE}, f_{SCE}, f_{FCE}$ satisfy where $\widehat{T}$ is the invertible stochastic matrix used to define the correction, and $\lambda$ is a constant selected to ensure the correct normalisation.

Figures (4)

  • Figure 1: CE, FCE and, our proposed FCE$+$B loss functions. The the forward-correction ensures consistency while the application of a bound (FCE$+$B) mitigates overfitting.
  • Figure 2: Performance as a function of the estimated noise rate used to compute noise-bound: We plot the final (clean) validation accuracy of a model against the estimated noise rate used to compute the noise-bound (Eqn. \ref{['eqn:noise_bound']}) on the noisy CIFAR10/EMNIST datasets using SCE/CE losses respectively. The noise-bound, as computed with the true noise rate is highlighted by the green dotted line; both graphs show a bump with a peak near this line demonstrating that underestimating the noise rate causes overfitting while overestimating causes underfitting. Most crucially, the prominent 'bump' reinforces that robustness can be greatly improved by training using a well-selected bound.
  • Figure 3: On the top row, we plot the upper and lower limits of $A(\eta, c)$ for $\eta\in (0, 0.5]$ from Corollary \ref{['cor:uniform_interval']} for the CE (red), SCE (yellow) and GCE (blue) losses for 10 classes (left) and 200 classes (right). On the bottom row, we plot a ratio of these upper and lower limits instead. We observe that the difference between these upper and lower limits is far greater for CE than the other losses. This is more pronounced for more classes.
  • Figure 4: Plot of $f^{-1}(p)$ for SCE ($A=8$), GCE ($a=0.7$), FCE ($\eta = 0.4$), CE and MAE in the binary case. We have the true probability $p$ on the x-axis and the choice of $q$, which minimises the expected loss on the $y$-axis.

Theorems & Definitions (27)

  • Definition 2.1: Generalised Forward-Correction
  • Lemma 2.2
  • Definition 3.1: Bounded Loss
  • Definition 4.1: Entropy Function
  • Lemma 4.2
  • Lemma 4.3
  • Corollary 4.4
  • Definition 4.5: Noise-Bound
  • Definition 4.6: Noise-Bounded Loss
  • Theorem C.1: Savage's Theorem
  • ...and 17 more