Dynamic Temperature Knowledge Distillation

Yukang Wei; Yu Bai

Dynamic Temperature Knowledge Distillation

Yukang Wei, Yu Bai

TL;DR

DTKD introduces per-sample dynamic temperatures in knowledge distillation by minimizing the sharpness difference between teacher and student logits, using sharpness defined as $\text{sharpness}(\mathbf{z}) = \log\sum_i e^{z_i}$. The method computes temperatures $T_{tea}$ and $T_{stu}$ from per-sample logit magnitudes to align output distributions and obtains the total loss $\mathcal{L}_{KD}=α\mathcal{L}_{DTKD}+β\mathcal{L}_{KL}+γ\mathcal{L}_{CE}$. Experiments on CIFAR-100 and ImageNet show competitive accuracy with added robustness to Target Class KD and None-target Class KD settings, while incurring minimal additional training cost. DTKD is simple to implement and compatible with existing KD variants such as DKD, offering a practical boost to knowledge transfer in varied teacher-student configurations.

Abstract

Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.The code is available at https://github.com/JinYu1998/DTKD.

Dynamic Temperature Knowledge Distillation

TL;DR

DTKD introduces per-sample dynamic temperatures in knowledge distillation by minimizing the sharpness difference between teacher and student logits, using sharpness defined as

. The method computes temperatures

and

from per-sample logit magnitudes to align output distributions and obtains the total loss

. Experiments on CIFAR-100 and ImageNet show competitive accuracy with added robustness to Target Class KD and None-target Class KD settings, while incurring minimal additional training cost. DTKD is simple to implement and compatible with existing KD variants such as DKD, offering a practical boost to knowledge transfer in varied teacher-student configurations.

Abstract

Paper Structure (33 sections, 2 theorems, 17 equations, 14 figures, 13 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 17 equations, 14 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Methodology
Background
Sharpness as a Unified Metric
Dynamic Temperature Knowledge Distillation
Effectiveness of DTKD
Experiments
Datasets and Settings
Main Results
Training details
Extensions
Feature transferability.
Training efficiency.
Robustness regarding TCKD and NCKD
...and 18 more sections

Key Result

Proposition 1

Assuming $\mathbf{u}$ and $\mathbf{v}$ are vectors in $\mathbb{R}^{n}$, and $\tau_1$, $\tau_2$ are two non-zero scalar real numbers, then we can derive

Figures (14)

Figure 1: ResNet8 as student tries to learn from teacher models of different sizes. The figure shows the student accuracy of the base line model, the vanilla KD and our DTKD. Both the fixed temperature of KD and the reference DTKD temperature are $4.0$.
Figure 2: Sharpness values from (a) the same sample with different models, and (b) The same model with various samples.
Figure 3: Teacher and student temperatures of DTKD over time.
Figure 4: Training time (per epoch) vs. accuracy on CIFAR-100. We set ResNet32$\times$4 as the teacher model and ResNet8$\times$4 as the student model. The legend shows the extra parameters required in different methods.
Figure 5: t-SNE of features learned by KD (left) and DTKD (right).
...and 9 more figures

Theorems & Definitions (3)

Proposition 1
Proposition 2
proof

Dynamic Temperature Knowledge Distillation

TL;DR

Abstract

Dynamic Temperature Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (3)