Table of Contents
Fetching ...

Noisy Early Stopping for Noisy Labels

William Toner, Amos Storkey

TL;DR

It is established that, in many typical learning environments, a noise-free validation set is not necessary for effective Early Stopping and near-optimal results can be achieved by monitoring accuracy on a noisy dataset - drawn from the same distribution as the noisy training set.

Abstract

Training neural network classifiers on datasets contaminated with noisy labels significantly increases the risk of overfitting. Thus, effectively implementing Early Stopping in noisy label environments is crucial. Under ideal circumstances, Early Stopping utilises a validation set uncorrupted by label noise to effectively monitor generalisation during training. However, obtaining a noise-free validation dataset can be costly and challenging to obtain. This study establishes that, in many typical learning environments, a noise-free validation set is not necessary for effective Early Stopping. Instead, near-optimal results can be achieved by monitoring accuracy on a noisy dataset - drawn from the same distribution as the noisy training set. Referred to as `Noisy Early Stopping' (NES), this method simplifies and reduces the cost of implementing Early Stopping. We provide theoretical insights into the conditions under which this method is effective and empirically demonstrate its robust performance across standard benchmarks using common loss functions.

Noisy Early Stopping for Noisy Labels

TL;DR

It is established that, in many typical learning environments, a noise-free validation set is not necessary for effective Early Stopping and near-optimal results can be achieved by monitoring accuracy on a noisy dataset - drawn from the same distribution as the noisy training set.

Abstract

Training neural network classifiers on datasets contaminated with noisy labels significantly increases the risk of overfitting. Thus, effectively implementing Early Stopping in noisy label environments is crucial. Under ideal circumstances, Early Stopping utilises a validation set uncorrupted by label noise to effectively monitor generalisation during training. However, obtaining a noise-free validation dataset can be costly and challenging to obtain. This study establishes that, in many typical learning environments, a noise-free validation set is not necessary for effective Early Stopping. Instead, near-optimal results can be achieved by monitoring accuracy on a noisy dataset - drawn from the same distribution as the noisy training set. Referred to as `Noisy Early Stopping' (NES), this method simplifies and reduces the cost of implementing Early Stopping. We provide theoretical insights into the conditions under which this method is effective and empirically demonstrate its robust performance across standard benchmarks using common loss functions.
Paper Structure (64 sections, 6 theorems, 70 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 64 sections, 6 theorems, 70 equations, 14 figures, 1 table, 1 algorithm.

Key Result

Theorem C.1

Let $p(x,y)$ be a data-label distribution and suppose that $\widetilde{p}(x,y)$ is a noisy version corrupted by class-preserving label noise. A probability estimator $\bm{q}^{*}$ is Bayes-optimal for the noisy distribution if and only if it is Bayes-optimal for the clean distribution. Equivalently;

Figures (14)

  • Figure 1: Illustration of the difference between noisy and clean accuracy and the non-trivial relationship between them: The figure depicts eight images from a web-scraped chihuahua dataset. The dataset contains label noise as it has accidentally scraped images of muffins muffin_chihuahua_dataset. A classifier, which correctly identifies all of the true chihuahuas (red), will obtain a noisy accuracy of $\frac{4}{8}\approx 50\%$ on this dataset despite obtaining a clean accuracy of $100\%$.
  • Figure 2: Symmetric label noise - noisy vs clean accuracy: A classifier model is trained using cross-entropy loss on the MNIST dataset, corrupted by 36% symmetric label noise. We plot the model's noisy and clean accuracies against each other at the end of each epoch, with early epochs coloured in dark blue and later epochs (around 100) in yellow. As expected, a linear relationship emerges between the noisy and clean accuracies. The theoretical relationship (Equation \ref{['eqn:sym_risk_relationship']}) is depicted by the black line, showing near-perfect alignment between the experimental results and theoretical predictions.
  • Figure 3: Noisy validation accuracy plotted against clean validation accuracy: Left: Decision Tree Classifier at increasing depths, fitted to a dataset containing $42\%$ pairwise label noise. Shallow depth models are represented in blue, transitioning to yellow as depth increases. A red dotted vertical line highlights the classifier achieving the highest noisy validation accuracy, while a blue dotted vertical line marks the classifier with the highest clean validation accuracy. The significant horizontal gap between these lines illustrates the limited effectiveness of Noise Early Stopping (NES) for optimising the depth of decision trees under this type of label noise. Right: Neural Network Classifier trained on the same noisy dataset. Early epochs are represented in blue, transitioning to yellow. A red dotted vertical line highlights the epoch with the highest noisy validation accuracy, while a blue dotted vertical line marks the epoch with the highest clean validation accuracy. The small horizontal gap between these lines illustrates the effectiveness of Noise Early Stopping (NES) for neural network models under similar noise conditions. For both graphs the light blue region represents bounds established by Fact 5, indicating that no model may achieve an accuracy/noisy-accuracy combination outside this region.
  • Figure 4: Comparison of NES (blue) and clean Early Stopping (ES) (yellow) across increasing noise rates ($\eta$). The plots show final clean test accuracy for models trained on the asymmetrically-noised MNIST dataset (left) and symmetrically-noised Fashion-MNIST dataset (right) when implementing each Early Stopping policy. Vertical lines at $\eta = 0.33$ for MNIST and $\eta = 0.9$ for Fashion-MNIST mark the thresholds where noise ceases to be class-preserving. Below these thresholds, NES closely aligns with ES, demonstrating its robustness under varying levels of label noise.
  • Figure 5: The top figure shows the clean test accuracy (blue) and noisy validation accuracy (yellow) during training on the symmetrically-noised Fashion dataset ($\eta=0.2$). We highlight the maximum clean accuracy achieved (red cross) and compare this with the accuracy achieved by Early Stopping using the noisy validation set (green plus) - the difference in clean test accuracy between these is a mere $0.38\%$. The second figure from the top displays repeats for the asymmetrically-noised CIFAR10 data. Once again the difference is a mere $0.55\%$. The bottom two figures provide the same information as the top two but are expressed differently (FashionMNIST on the left and MNIST on the right). We plot the clean test accuracy against the noisy validation accuracy during training. Each epoch is coloured so that the first few epochs are blue and the final epochs are yellow, with the hue shifting gradually from blue to yellow through red. Initially, both the noisy and clean accuracy increase, moving into the upper right corner of the graph before overfitting occurs, leading to a decline in both accuracies. For both datasets, the noisy and clean accuracies are maximised approximately simultaneously.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Definition 2.1: Class-Preserving Label Noise
  • Theorem C.1
  • proof
  • Lemma C.2
  • proof
  • Theorem C.3
  • proof
  • Corollary C.4
  • Lemma C.5
  • proof
  • ...and 1 more