Table of Contents
Fetching ...

Hard Samples, Bad Labels: Robust Loss Functions That Know When to Back Off

Nicholas Pellegrino, David Szczecina, Paul Fieguth

TL;DR

This paper tackles pervasive label noise in supervised learning by introducing two robust loss functions, Blurry Loss and Piecewise-zero Loss, that down-weight or ignore samples likely to be mislabeled. The authors formalize the problem, situate their approach within existing label-error detection frameworks (CL and AUM), and provide a comprehensive empirical evaluation against standard and state-of-the-art robust losses. Across multiple datasets and corruption regimes (including non-uniform real-world noise), BL and PZ consistently improve label-error detection performance (measured by F1 and Balanced Accuracy), aided by a loss scheduling strategy that preserves learning on clean data. The results suggest these losses offer practical, broadly applicable tools for data curation and robust model training in the presence of label imperfections, with implications for improved data quality and reliability in real-world AI systems.

Abstract

Incorrectly labelled training data are frustratingly ubiquitous in both benchmark and specially curated datasets. Such mislabelling clearly adversely affects the performance and generalizability of models trained through supervised learning on the associated datasets. Frameworks for detecting label errors typically require well-trained / well-generalized models; however, at the same time most frameworks rely on training these models on corrupt data, which clearly has the effect of reducing model generalizability and subsequent effectiveness in error detection -- unless a training scheme robust to label errors is employed. We evaluate two novel loss functions, Blurry Loss and Piecewise-zero Loss, that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, which are likely to be erroneous. These loss functions leverage the idea that mislabelled examples are typically more difficult to classify and should contribute less to the learning signal. Comprehensive experiments on a variety of artificially corrupted datasets demonstrate that the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection. Further analyses through ablation studies offer insights to confirm these loss functions' broad applicability to cases of both uniform and non-uniform corruption, and with different label error detection frameworks. By using these robust loss functions, machine learning practitioners can more effectively identify, prune, or correct errors in their training data.

Hard Samples, Bad Labels: Robust Loss Functions That Know When to Back Off

TL;DR

This paper tackles pervasive label noise in supervised learning by introducing two robust loss functions, Blurry Loss and Piecewise-zero Loss, that down-weight or ignore samples likely to be mislabeled. The authors formalize the problem, situate their approach within existing label-error detection frameworks (CL and AUM), and provide a comprehensive empirical evaluation against standard and state-of-the-art robust losses. Across multiple datasets and corruption regimes (including non-uniform real-world noise), BL and PZ consistently improve label-error detection performance (measured by F1 and Balanced Accuracy), aided by a loss scheduling strategy that preserves learning on clean data. The results suggest these losses offer practical, broadly applicable tools for data curation and robust model training in the presence of label imperfections, with implications for improved data quality and reliability in real-world AI systems.

Abstract

Incorrectly labelled training data are frustratingly ubiquitous in both benchmark and specially curated datasets. Such mislabelling clearly adversely affects the performance and generalizability of models trained through supervised learning on the associated datasets. Frameworks for detecting label errors typically require well-trained / well-generalized models; however, at the same time most frameworks rely on training these models on corrupt data, which clearly has the effect of reducing model generalizability and subsequent effectiveness in error detection -- unless a training scheme robust to label errors is employed. We evaluate two novel loss functions, Blurry Loss and Piecewise-zero Loss, that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, which are likely to be erroneous. These loss functions leverage the idea that mislabelled examples are typically more difficult to classify and should contribute less to the learning signal. Comprehensive experiments on a variety of artificially corrupted datasets demonstrate that the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection. Further analyses through ablation studies offer insights to confirm these loss functions' broad applicability to cases of both uniform and non-uniform corruption, and with different label error detection frameworks. By using these robust loss functions, machine learning practitioners can more effectively identify, prune, or correct errors in their training data.

Paper Structure

This paper contains 34 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The two proposed loss functions: Blurry Loss (\ref{['fig:BL_loss_graphic']}) and Piecewise-zero Loss (\ref{['fig:PZ_loss_graphic']}).
  • Figure 2: Predicted probability (top) and Gradient (bottom) distributions per epoch, for CIFAR-100 at $\eta=0.4$. With all loss functions, the predicted probabilities of corrupt data (red) are less than those of clean data (green). With CE, the gradients of corrupt data are large and negative, providing a strong signal to the optimizer to fit to these corrupt data, whereas with BL, gradients of corrupt data are large and positive, steering away from these corrupt data, and with PZ, gradients of corrupt data are nearly all at zero, imparting no impact on training. A version of this figure without truncated vertical limits can be found in \ref{['fig:gradients_Full']} of \ref{['subsec:appendix_expanded']}.
  • Figure 3: FL (dotted lines) results in better F1 scores than BL (\ref{['fig:lowCR_Prec_Rec_F1_BL']}) or PZ (\ref{['fig:lowCR_Prec_Rec_F1_PZ']}) (solid lines) at $\eta=0.1$ on the CIFAR-100 dataset. This is a result of conservative detection, whereby fewer samples are detected and precision is maintained. Through tuning the parameters of the proposed loss functions, performance can be made fairly similar to that of FL, or recall can be increased dramatically while only marginally reducing precision, which may be considered worthwhile at low corruption rates.
  • Figure 4: Heatmap showing variation in F1 score as loss function parameters are varied, for MNIST (\ref{['fig:parameters_vs_cr_MNIST']}) and CIFAR-100 (\ref{['fig:parameters_vs_cr_CIFAR100']}). In the left blocks, separated by a white line, results for baseline loss functions are shown. Two separate blocks are shown for the proposed loss functions, plotted over a range in their parameters. Best results are indicated with bolded yellow text. To illustrate the trend in optimal parameter settings for the proposed loss functions, yellow boxes mark the best performing parameter setting for each row. Note that delay, $d$, is set to 0 for BL and 1 for PZ loss.
  • Figure 5: Detection F1 score vs. Delay, $d$, for PZ loss on CIFAR-100 at $\eta$ ranging from $0.1$ to $0.4$ in panels (\ref{['fig:pz_delay_cr1']}) to (\ref{['fig:pz_delay_cr4']}). Observe that the best performance comes at $d=1$ in all cases, for an appropriate setting of the cutoff, $c$.
  • ...and 3 more figures