Table of Contents
Fetching ...

When Does Label Smoothing Help?

Rafael Müller, Simon Kornblith, Geoffrey Hinton

TL;DR

Label smoothing improves generalization and model calibration by shaping penultimate-layer representations into tight, equidistant class clusters, a phenomenon visualized via a novel projection method. While this boosts accuracy and calibration across vision and translation tasks, it erases fine-grained logit information, diminishing the effectiveness of knowledge distillation. The work also shows that calibration (ECE) can be improved without post-hoc temperature scaling, and that translation improvements (BLEU) may occur even with worse NLL. Overall, the paper clarifies when label smoothing is beneficial and highlights a trade-off between calibration/generalization and information preservation for distillation.

Abstract

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

When Does Label Smoothing Help?

TL;DR

Label smoothing improves generalization and model calibration by shaping penultimate-layer representations into tight, equidistant class clusters, a phenomenon visualized via a novel projection method. While this boosts accuracy and calibration across vision and translation tasks, it erases fine-grained logit information, diminishing the effectiveness of knowledge distillation. The work also shows that calibration (ECE) can be improved without post-hoc temperature scaling, and that translation improvements (BLEU) may occur even with worse NLL. Overall, the paper clarifies when label smoothing is beneficial and highlights a trade-off between calibration/generalization and information preservation for distillation.

Abstract

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

Paper Structure

This paper contains 17 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Visualization of penultimate layer's activations of: AlexNet/CIFAR-10 (first row), CIFAR-100/ResNet-56 (second row) and ImageNet/Inception-v4 with three semantically different classes (third row) and two semantically similar classes plus a third one (fourth row).
  • Figure 2: Reliability diagram of ResNet-56/CIFAR-100 (left) and Inception-v4/ImageNet (right).
  • Figure 3: Reliability diagram of Transformer trained on EN-DE dataset.
  • Figure 4: Effect of calibration of Transformer upon BLEU score (blue lines) and NLL (red lines). Curves without markers reflect networks trained without label smoothing while curves with markers represent networks with label smoothing.
  • Figure 5: Performance of distillation from ResNet-56 to AlexNet on CIFAR-10.
  • ...and 2 more figures