Enhancing Noise-Robust Losses for Large-Scale Noisy Data Learning

Max Staats; Matthias Thamm; Bernd Rosenow

Enhancing Noise-Robust Losses for Large-Scale Noisy Data Learning

Max Staats, Matthias Thamm, Bernd Rosenow

TL;DR

This work addresses learning from large-scale noisy labels by examining early training dynamics of bounded, noise-robust losses and identifying a misalignment between initial logits and regions of nonzero gradients as class count grows. It introduces a logit bias, adding a small constant $\epsilon$ to the correct-class logit, to restore gradient overlap and enable effective learning, notably improving MAE (MAE*) and enabling other losses to perform well on WebVision without dataset- or noise-dependent hyperparameters. The authors also provide a method to compute hyperparameters from the number of classes, using the initial backpropagation error as a guide, and demonstrate that both logit bias and hyperparameter-calibration enable state-of-the-art performance across CIFAR, Fashion-MNIST, and WebVision. Overall, the paper contributes a potentially universal, low-tuning approach to robust learning in multiclass scenarios with noisy labels, with practical impact for scaling to $K$ in real-world datasets.

Abstract

Large annotated datasets inevitably contain noisy labels, which poses a major challenge for training deep neural networks as they easily memorize the labels. Noise-robust loss functions have emerged as a notable strategy to counteract this issue, but it remains challenging to create a robust loss function which is not susceptible to underfitting. Through a quantitative approach, this paper explores the limited overlap between the network output at initialization and regions of non-vanishing gradients of bounded loss functions in the initial learning phase. Using these insights, we address underfitting of several noise robust losses with a novel method denoted as logit bias, which adds a real number $ε$ to the logit at the position of the correct class. The logit bias enables these losses to achieve state-of-the-art results, even on datasets like WebVision, consisting of over a million images from 1000 classes. In addition, we demonstrate that our method can be used to determine optimal parameters for several loss functions -- without having to train networks. Remarkably, our method determines the hyperparameters based on the number of classes, resulting in loss functions which require zero dataset or noise-dependent parameters.

Enhancing Noise-Robust Losses for Large-Scale Noisy Data Learning

TL;DR

to the correct-class logit, to restore gradient overlap and enable effective learning, notably improving MAE (MAE*) and enabling other losses to perform well on WebVision without dataset- or noise-dependent hyperparameters. The authors also provide a method to compute hyperparameters from the number of classes, using the initial backpropagation error as a guide, and demonstrate that both logit bias and hyperparameter-calibration enable state-of-the-art performance across CIFAR, Fashion-MNIST, and WebVision. Overall, the paper contributes a potentially universal, low-tuning approach to robust learning in multiclass scenarios with noisy labels, with practical impact for scaling to

in real-world datasets.

Abstract

to the logit at the position of the correct class. The logit bias enables these losses to achieve state-of-the-art results, even on datasets like WebVision, consisting of over a million images from 1000 classes. In addition, we demonstrate that our method can be used to determine optimal parameters for several loss functions -- without having to train networks. Remarkably, our method determines the hyperparameters based on the number of classes, resulting in loss functions which require zero dataset or noise-dependent parameters.

Paper Structure (15 sections, 7 equations, 6 figures, 6 tables)

This paper contains 15 sections, 7 equations, 6 figures, 6 tables.

Introduction
Related Work
Theoretical Considerations
Comparison of various loss functions
Boosting learning with example-dependent logit bias
Calculation of hyperparameters
Empirical Results
Conclusion
Appendix
Training Parameters
Choosing the right logit bias
Empirical output of a freshly initialized neural network
Estimating hyperparameters of other loss functions
Empirical results on a different architecture
Backpropagation error of bounded losses

Figures (6)

Figure 1: Analysis of the average error for the neuron corresponding to the correct label, determined for the MAE loss. Increasing the number of classes from 10 to 100 leads to a notable shift in the range where the average error $\langle \partial_{z_k} \mathcal{L} \rangle$ is non-vanishing. However, the logit distribution of a newly initialized network (shown in the histogram at the bottom) remains class-count invariant. This mismatch leads to diminished gradients, stalling the learning due to tiny errors.
Figure 2: The average error $\langle \delta_k \rangle$ of the final layer's correct neuron $k$, a determinant factor for the magnitude of a gradient descent update, plotted against the pre-activation $z_k$, for a network at initialization. Panel (a) depicts a ten-class learning scenario, showcasing a pronounced overlap between the region where learning is possible (characterized by large negative $\langle \delta_k \rangle$) and the logit distribution from a randomly initialized network (blue histogram). In contrast, in a scenario with 1000 classes shown in panel (b), this overlap is not present for the bounded loss functions. Rather than employing a tailed loss --- exemplified by biTemp (delineated by the dashed green line) --- our approach involves shifting the $z_k$ distribution into the learning-possible range by adding a bias to the correct neuron's pre-activation (green histogram). We note that the average error of NCE-AGCE is rescaled by a factor of $0.5$ in panel a) to enhance visual clarity.
Figure 3: Computation of the hyperparameter $q$ for genCE when changing from $100$ to $1000$ classes. Instead of performing an expensive hyperparameter search, we suggest to adjust the hyperparameter such that $\langle\delta_k \rangle$ at position $z_k=0$ is unchanged for the larger class size.
Figure 4: The logit bias value $\epsilon$ for the Mean Absolute Error as a function of the number of classes present in the dataset.
Figure 5: Logit distribution $p_z$ for (a) the fully connected network MLP1024 and (b) the ResNet-34 architecture. The histogram displays the empirical logits for different network initializations given the first images of the Cifar dataset. The green curve displays a normal distribution with a mean of zero. The variance of the normal distribution is set to the variance of the logit distribution.
...and 1 more figures

Enhancing Noise-Robust Losses for Large-Scale Noisy Data Learning

TL;DR

Abstract

Enhancing Noise-Robust Losses for Large-Scale Noisy Data Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)