Table of Contents
Fetching ...

Dynamic Temperature Knowledge Distillation

Yukang Wei, Yu Bai

TL;DR

DTKD introduces per-sample dynamic temperatures in knowledge distillation by minimizing the sharpness difference between teacher and student logits, using sharpness defined as $\text{sharpness}(\mathbf{z}) = \log\sum_i e^{z_i}$. The method computes temperatures $T_{tea}$ and $T_{stu}$ from per-sample logit magnitudes to align output distributions and obtains the total loss $\mathcal{L}_{KD}=α\mathcal{L}_{DTKD}+β\mathcal{L}_{KL}+γ\mathcal{L}_{CE}$. Experiments on CIFAR-100 and ImageNet show competitive accuracy with added robustness to Target Class KD and None-target Class KD settings, while incurring minimal additional training cost. DTKD is simple to implement and compatible with existing KD variants such as DKD, offering a practical boost to knowledge transfer in varied teacher-student configurations.

Abstract

Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.The code is available at https://github.com/JinYu1998/DTKD.

Dynamic Temperature Knowledge Distillation

TL;DR

DTKD introduces per-sample dynamic temperatures in knowledge distillation by minimizing the sharpness difference between teacher and student logits, using sharpness defined as . The method computes temperatures and from per-sample logit magnitudes to align output distributions and obtains the total loss . Experiments on CIFAR-100 and ImageNet show competitive accuracy with added robustness to Target Class KD and None-target Class KD settings, while incurring minimal additional training cost. DTKD is simple to implement and compatible with existing KD variants such as DKD, offering a practical boost to knowledge transfer in varied teacher-student configurations.

Abstract

Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.The code is available at https://github.com/JinYu1998/DTKD.
Paper Structure (33 sections, 2 theorems, 17 equations, 14 figures, 13 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 17 equations, 14 figures, 13 tables, 1 algorithm.

Key Result

Proposition 1

Assuming $\mathbf{u}$ and $\mathbf{v}$ are vectors in $\mathbb{R}^{n}$, and $\tau_1$, $\tau_2$ are two non-zero scalar real numbers, then we can derive

Figures (14)

  • Figure 1: ResNet8 as student tries to learn from teacher models of different sizes. The figure shows the student accuracy of the base line model, the vanilla KD and our DTKD. Both the fixed temperature of KD and the reference DTKD temperature are $4.0$.
  • Figure 2: Sharpness values from (a) the same sample with different models, and (b) The same model with various samples.
  • Figure 3: Teacher and student temperatures of DTKD over time.
  • Figure 4: Training time (per epoch) vs. accuracy on CIFAR-100. We set ResNet32$\times$4 as the teacher model and ResNet8$\times$4 as the student model. The legend shows the extra parameters required in different methods.
  • Figure 5: t-SNE of features learned by KD (left) and DTKD (right).
  • ...and 9 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • proof