Knowledge Distillation Based on Transformed Teacher Matching

Kaixiang Zheng; En-Hui Yang

Knowledge Distillation Based on Transformed Teacher Matching

Kaixiang Zheng, En-Hui Yang

TL;DR

This paper interrogates the role of temperature in knowledge distillation and proposes transformed teacher matching (TTM), which removes temperature on the student side while reinterpreting temperature as a power transform that induces Rényi-entropy regularization. It shows that KD can be decomposed into KD plus a Rényi-entropy term, offering stronger generalization with TTM, and extends this idea with weighted TTM (WTTM) that uses sample-adaptive weights to emphasize smoother teacher targets. Empirical results on CIFAR-100, ImageNet, and transformer-based setups demonstrate that TTM and especially WTTM consistently outperform traditional KD and many contemporary distillation methods, achieving state-of-the-art performance in several settings. The approach is simple to implement, computationally comparable to KD, and comes with downloadable source code, underscoring its practical impact for efficient model compression and transfer learning.

Abstract

As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent Rényi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.

Knowledge Distillation Based on Transformed Teacher Matching

TL;DR

Abstract

Paper Structure (25 sections, 20 equations, 4 figures, 13 tables, 1 algorithm)

This paper contains 25 sections, 20 equations, 4 figures, 13 tables, 1 algorithm.

Introduction
Background and Related Work
Confidence Penalty
Rényi Entropy
Label Smoothing Perspective towards KD
Statistical Perspective and Cross Entropy Upper Bound
Transformed Teacher Matching
Power Transform of Probability Distributions
From KD to TTM
Sample-adaptive Matching to the Transformed Teacher
Experiments
Experimental Settings
Main Results
Extensions
Conclusion
...and 10 more sections

Figures (4)

Figure 1: Average $H(q)$ of 3 teacher-student pairs during training. For fair comparison, we use the same temperature $T=4$ for KD, TTM and WTTM. The $\lambda$ for KD is 0.9, so the $\beta$ for TTM is 36, computed by Eq. (\ref{['eq:3-8']}), in order to maintain the same ratio between $H(y, q)$ and $H(p^t_T,q_T)$ as KD. As for WTTM, $\beta=36/\Bar{U}$, where $\Bar{U}$ is the average of $U_{\frac{1}{T}} (p^t)$ over all samples.
Figure 2: Average $D(p^t_T||q)$ of 3 teacher-student pairs during training. For each pair, the same $T$ is adopted in TTM and WTTM.
Figure 3: Entropy histograms for resnet20 trained with $\mathcal{L}_{CE}$, $\mathcal{L}_{LS}$ with $\epsilon=0.5$, $\mathcal{L}_{KD}$ with $T=1$, and $\mathcal{L}_{KD}$ with $T=4$. For fair comparison, the same $\lambda=0.9$ is adopted in both KD experiments with different temperatures.
Figure 4: (a) Various point-wise mappings. (b) Power functions with different exponents $\gamma$.

Knowledge Distillation Based on Transformed Teacher Matching

TL;DR

Abstract

Knowledge Distillation Based on Transformed Teacher Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (4)