Table of Contents
Fetching ...

Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

Guoxuan Xia, Olivier Laurent, Gianni Franchi, Christos-Savvas Bouganis

TL;DR

The paper investigates why label smoothing (LS) degrades selective classification (SC) and shows this occurs across diverse architectures and tasks, including ImageNet and Cityscapes. Through a gradient-based analysis, it reveals that LS imposes an imbalanced suppression of the max logit: it more strongly dampens logits when a prediction is likely correct and less so when it is likely wrong, flattening the uncertainty gap between correct and incorrect predictions. This degrades the ranking used by uncertainty-based rejection (SC), especially at low coverage. The authors demonstrate that post-hoc logit normalisation can effectively recover SC performance for LS-trained models by reversing this suppression pattern, with the effectiveness explained by the same gradient-based mechanism. The work provides practical guidance for deployment and points to broader implications for training-time label augmentation and uncertainty estimation in safety-critical settings.

Abstract

Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by suppressing the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective.

Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

TL;DR

The paper investigates why label smoothing (LS) degrades selective classification (SC) and shows this occurs across diverse architectures and tasks, including ImageNet and Cityscapes. Through a gradient-based analysis, it reveals that LS imposes an imbalanced suppression of the max logit: it more strongly dampens logits when a prediction is likely correct and less so when it is likely wrong, flattening the uncertainty gap between correct and incorrect predictions. This degrades the ranking used by uncertainty-based rejection (SC), especially at low coverage. The authors demonstrate that post-hoc logit normalisation can effectively recover SC performance for LS-trained models by reversing this suppression pattern, with the effectiveness explained by the same gradient-based mechanism. The work provides practical guidance for deployment and points to broader implications for training-time label augmentation and uncertainty estimation in safety-critical settings.

Abstract

Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by suppressing the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective.
Paper Structure (45 sections, 26 equations, 25 figures, 2 tables)

This paper contains 45 sections, 26 equations, 25 figures, 2 tables.

Figures (25)

  • Figure 1: Top: LS causes overconfidence for semantic segmentation. The LS-trained model predicts much lower (ranked) uncertainty on incorrect ✗ segmentations than CE. In particular, for the erroneous region on the left where the model has predicted parts of the "sidewalk" as "road", the LS model is highly overconfident. This could have dire consequences in a safety-critical application such as autonomous driving. Bottom: LS leads to close to 0% of samples being accepted (coverage) when a strict tolerance of 1% error on accepted samples (risk) is imposed on ImageNet. Deployment-time logit normalisation effectively negates the degradation caused by LS.
  • Figure 2: Left: illustration of how label smoothing (LS) alters a training label. LS reduces data supervision and adds regularisation, potentially improving generalisation by reducing overfitting. Right: illustration of selective classification (SC). Uncertain samples ($U>\tau$) are rejected/detected, to reduce the number of errors ✗ served by the system. Rejected samples can be discarded or processed separately (e.g. deferred to a human expert). We wish to better separate/rank ✓ vs ✗ via $U$.
  • Figure 3: Risk-coverage plots for different levels of LS $\alpha$ for different models and tasks (ImageNet classification and Cityscapes semantic segmentation). Although it may improve error rate/accuracy at 100% coverage, label smoothing consistently degrades SC performance.
  • Figure 4: How the suppression gradient (\ref{['eq:reg_grad']}), i.e. the difference between LS and CE gradients, affects the logits differently. LS affects the max logit differently depending on how well fit the model is for a given sample $\boldsymbol x$.In the left two when $U$ is lower (sharper softmax), the suppression on the max logit is lower when the model is poorly fit and likely to be wrong. In the right two when $U$ is higher (flatter softmax), the suppression is higher when the model is poorly fit and more likely to be correct. Thus, LS degrades the softmax's ability to separate ✓ vs ✗ , hurting SC.
  • Figure 5: Distribution of the max logit $v_\text{max}$given the MSP $\pi_\text{max}$ for correct ✓ and incorrect ✗ predictions on evaluation data. $v_\text{max}$ is lower for ✓ for the LS model, whilst the distributions are roughly similar for CE. This empirically matches the imbalanced max logit suppression described in \ref{['eq:vmax_reg']}. We calculate the mean$\pm$std. in a 0.05-wide sliding window.
  • ...and 20 more figures

Theorems & Definitions (1)

  • proof