Table of Contents
Fetching ...

A Unified Revisit of Temperature in Classification-Based Knowledge Distillation

Logan Frank, Jim Davis

TL;DR

From analyzing these cross-connections, this work identifies and presents common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

Abstract

A central idea of knowledge distillation is to expose relational structure embedded in the teacher's weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

A Unified Revisit of Temperature in Classification-Based Knowledge Distillation

TL;DR

From analyzing these cross-connections, this work identifies and presents common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

Abstract

A central idea of knowledge distillation is to expose relational structure embedded in the teacher's weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.
Paper Structure (8 sections, 1 equation, 7 figures, 2 tables)

This paper contains 8 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Accuracy of students distilled using various KD methods and temperatures.
  • Figure 2: Accuracy of students distilled with different temperatures for various durations (number of epochs) using different combinations of optimizer and batch size.
  • Figure 3: Accuracy of students at various temperatures. Color and grayscale denote evaluation at largest number of epochs and an earlier point (using right y-scale), respectively.
  • Figure 4: Colored figures: average sorted temperature-scaled softmax distributions of training samples passed through the teachers (truncated to top-20 classes). Grayscale figures: label smoothing distributions with entropies similar to the softmax distributions. Note the x-axis scales reducing with increasing temperature.
  • Figure 5: Accuracy of ResNet18 students distilled using various methods that alter the amount of relationship information provided by the teacher during training.
  • ...and 2 more figures