Label Noise: Correcting the Forward-Correction
William Toner, Amos Storkey
TL;DR
Label Noise: Correcting the Forward-Correction addresses learning under label noise by highlighting that forward-corrected losses, while consistent under a known noise model, can overfit finite noisy datasets. It introduces a principled loss-bounding strategy, deriving a noise-bound $B(\eta,c)$ from the average entropy of noisy class posteriors under a separability assumption and applying it as a lower-bound maintenance in training. The work generalises forward-correction to non-linear noise models and contextualises popular robust losses (e.g., GCE, SCE) as generalized forward-corrected losses, unifying their interpretation. Empirically, the proposed noise-bounded loss improves robustness across diverse datasets and noise types with minimal overhead, though performance depends on accurate noise-rate estimation and the validity of the separability assumption.
Abstract
Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels. To address this issue, researchers have explored alternative loss functions that aim to be more robust. The `forward-correction' is a popular approach wherein the model outputs are noised before being evaluated against noisy data. When the true noise model is known, applying the forward-correction guarantees consistency of the learning algorithm. While providing some benefit, the correction is insufficient to prevent overfitting to finite noisy datasets. In this work, we propose an approach to tackling overfitting caused by label noise. We observe that the presence of label noise implies a lower bound on the noisy generalised risk. Motivated by this observation, we propose imposing a lower bound on the training loss to mitigate overfitting. Our main contribution is providing theoretical insights that allow us to approximate the lower bound given only an estimate of the average noise rate. We empirically demonstrate that using this bound significantly enhances robustness in various settings, with virtually no additional computational cost.
