Table of Contents
Fetching ...

Dynamic Temperature Scheduler for Knowledge Distillation

Sibgat Ul Islam, Jawad Ibn Ahad, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman

TL;DR

Knowledge Distillation relies on a fixed temperature $T$ to soften teacher outputs, but this static setting yields suboptimal gradient signals at different training stages. The authors propose Dynamic Temperature Scheduler (DTS), which combines cosine scheduling, loss-gap based adaptive scaling, and a momentum-based temperature update to adjust $T$ during training according to the divergence between teacher and student. They show that this method improves KD performance across vision (e.g., CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI) without introducing extra learnable modules, and provide code for replication. The work highlights the practical impact of adaptive temperature on gradient flow and knowledge transfer, and suggests directions for future improvements such as instance-wise or separate temperature settings for teacher and student.

Abstract

Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature throughout training, which is suboptimal. Moreover, architectural differences between teacher and student often result in mismatched logit magnitudes. We demonstrate that students benefit from softer probabilities early in training but require sharper probabilities in later stages. We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student. To our knowledge, this is the first temperature scheduling method that adapts based on the divergence between teacher and student distributions. Our method integrates seamlessly with existing KD frameworks. We validate DTS across multiple KD strategies on vision (CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI), consistently outperforming static-temperature baselines. Code is available at https://github.com/Sibgat-Ul/DTS.

Dynamic Temperature Scheduler for Knowledge Distillation

TL;DR

Knowledge Distillation relies on a fixed temperature to soften teacher outputs, but this static setting yields suboptimal gradient signals at different training stages. The authors propose Dynamic Temperature Scheduler (DTS), which combines cosine scheduling, loss-gap based adaptive scaling, and a momentum-based temperature update to adjust during training according to the divergence between teacher and student. They show that this method improves KD performance across vision (e.g., CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI) without introducing extra learnable modules, and provide code for replication. The work highlights the practical impact of adaptive temperature on gradient flow and knowledge transfer, and suggests directions for future improvements such as instance-wise or separate temperature settings for teacher and student.

Abstract

Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature throughout training, which is suboptimal. Moreover, architectural differences between teacher and student often result in mismatched logit magnitudes. We demonstrate that students benefit from softer probabilities early in training but require sharper probabilities in later stages. We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student. To our knowledge, this is the first temperature scheduling method that adapts based on the divergence between teacher and student distributions. Our method integrates seamlessly with existing KD frameworks. We validate DTS across multiple KD strategies on vision (CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI), consistently outperforming static-temperature baselines. Code is available at https://github.com/Sibgat-Ul/DTS.

Paper Structure

This paper contains 12 sections, 16 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: A high level overview of our Dynamic Temperature Scheduler (DTS). The logits from the models are used to calculate cross-entropy loss with the true labels then passed to the DTS along with the temperature at current epoch. Using these a new temperature is calculated which is then used for the distillation.
  • Figure 2: This figure shows the performance of the ResNet20 distilled by ResNet56 and ResNet110 on a 50 epoch cifar-100 training on different ranges of temperature, without modifying optimizer settings. The horizontal x-axis represents the ranges of temperatures.