Table of Contents
Fetching ...

Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective

Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, Qian Zhang

TL;DR

The paper investigates how soft labels in knowledge distillation influence bias and variance, revealing that the bias-variance tradeoff is highly sample-dependent during training. It decomposes the KD loss, identifies regularization samples that drive variance reduction at the expense of bias, and shows that excluding them harms performance. To address this, it introduces weighted soft labels that adaptively down-weight regularization samples based on teacher-student predictions, achieving state-of-the-art results on CIFAR-100 and ImageNet. The approach provides a practical means to harness the regularization benefits of soft labels while mitigating adverse bias effects, with broad implications for KD applications.

Abstract

Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies \citep{muller2019does,yuan2020revisiting} revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at \url{https://github.com/bellymonster/Weighted-Soft-Label-Distillation}.

Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective

TL;DR

The paper investigates how soft labels in knowledge distillation influence bias and variance, revealing that the bias-variance tradeoff is highly sample-dependent during training. It decomposes the KD loss, identifies regularization samples that drive variance reduction at the expense of bias, and shows that excluding them harms performance. To address this, it introduces weighted soft labels that adaptively down-weight regularization samples based on teacher-student predictions, achieving state-of-the-art results on CIFAR-100 and ImageNet. The approach provides a practical means to harness the regularization benefits of soft labels while mitigating adverse bias effects, with broad implications for KD applications.

Abstract

Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies \citep{muller2019does,yuan2020revisiting} revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at \url{https://github.com/bellymonster/Weighted-Soft-Label-Distillation}.

Paper Structure

This paper contains 26 sections, 5 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Bias and variance.
  • Figure 2: The number of regularization samples with respect to training epochs. The distillation settings are the same as the settings in Tab. \ref{['tab:label-smoothing']}.
  • Figure 3: Computational graph of knowledge distillation with our proposed weighted soft labels.
  • Figure 4: Visualization of the resemblances introduced by soft label regularizers: (a) VGG-19 (Teacher) $\rightarrow$ VGG-16 (Student), (b) ResNet-50 (Teacher) $\rightarrow$ ResNet-18 (Student). And semantic similarity between label names: (c) LCH similarity pedersen2004wordnet, (d) WUP similarity pedersen2004wordnet. Darker areas denote larger values.