Table of Contents
Fetching ...

Adaptive Temperature Based on Logits Correlation in Knowledge Distillation

Kazuhiro Matsuyama, Usman Anjum, Satoko Matsuyama, Tetsuo Shoda, Justin Zhan

TL;DR

This work tackles the inefficiency of fixed temperature in knowledge distillation by proposing a dynamic, sample-wise temperature derived from the teacher's maximum logit. Through a Taylor-series approximation of softmax and a KL-divergence framework, the authors show that the leading, lower-order terms converge to the correlation between teacher and student logits, enabling a temperature computation that reduces computation and improves transfer quality. They establish a radius-of-convergence condition and analyze truncation effects, presenting an efficient algorithm to obtain the adaptive temperature before distillation. Empirical results on CIFAR-100 demonstrate improved accuracy over static and several dynamic baselines, along with notable reductions in per-epoch computation, indicating practical benefits for KD, especially in resource-constrained scenarios.

Abstract

Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model. It applies the outputs of a model to train another model having comparable accuracy. These two distinct models are similar to the way information is delivered in human society, with one acting as the "teacher" and the other as the "student". Softmax plays a role in comparing logits generated by models with each other by converting probability distributions. It delivers the logits of a teacher to a student with compression through a parameter named temperature. Tuning this variable reinforces the distillation performance. Although only this parameter helps with the interaction of logits, it is not clear how temperatures promote information transfer. In this paper, we propose a novel approach to calculate the temperature. Our method only refers to the maximum logit generated by a teacher model, which reduces computational time against state-of-the-art methods. Our method shows a promising result in different student and teacher models on a standard benchmark dataset. Algorithms using temperature can obtain the improvement by plugging in this dynamic approach. Furthermore, the approximation of the distillation process converges to a correlation of logits by both models. This reinforces the previous argument that the distillation conveys the relevance of logits. We report that this approximating algorithm yields a higher temperature compared to the commonly used static values in testing.

Adaptive Temperature Based on Logits Correlation in Knowledge Distillation

TL;DR

This work tackles the inefficiency of fixed temperature in knowledge distillation by proposing a dynamic, sample-wise temperature derived from the teacher's maximum logit. Through a Taylor-series approximation of softmax and a KL-divergence framework, the authors show that the leading, lower-order terms converge to the correlation between teacher and student logits, enabling a temperature computation that reduces computation and improves transfer quality. They establish a radius-of-convergence condition and analyze truncation effects, presenting an efficient algorithm to obtain the adaptive temperature before distillation. Empirical results on CIFAR-100 demonstrate improved accuracy over static and several dynamic baselines, along with notable reductions in per-epoch computation, indicating practical benefits for KD, especially in resource-constrained scenarios.

Abstract

Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model. It applies the outputs of a model to train another model having comparable accuracy. These two distinct models are similar to the way information is delivered in human society, with one acting as the "teacher" and the other as the "student". Softmax plays a role in comparing logits generated by models with each other by converting probability distributions. It delivers the logits of a teacher to a student with compression through a parameter named temperature. Tuning this variable reinforces the distillation performance. Although only this parameter helps with the interaction of logits, it is not clear how temperatures promote information transfer. In this paper, we propose a novel approach to calculate the temperature. Our method only refers to the maximum logit generated by a teacher model, which reduces computational time against state-of-the-art methods. Our method shows a promising result in different student and teacher models on a standard benchmark dataset. Algorithms using temperature can obtain the improvement by plugging in this dynamic approach. Furthermore, the approximation of the distillation process converges to a correlation of logits by both models. This reinforces the previous argument that the distillation conveys the relevance of logits. We report that this approximating algorithm yields a higher temperature compared to the commonly used static values in testing.

Paper Structure

This paper contains 12 sections, 12 equations, 2 figures, 4 tables, 3 algorithms.

Figures (2)

  • Figure 1: Training Loss and Correlation per Epoch. Note that the curve of the correlation subtracts one for a comparison. A teacher and a student are vgg113 and vgg8, respectively. Note that the correlation of logits in z-score standardized dataset is equivalent to cosine similarity.
  • Figure 2: The cumulative computational time per epoch with different architecture. Our method determines a temperature faster than other method. The performance is evaluated with NVIDIA V100 GPU.