Table of Contents
Fetching ...

Maximizing Discrimination Capability of Knowledge Distillation with Energy Function

Seonghak Kim, Gyeongdo Ham, Suin Lee, Donggon Jang, Daeshik Kim

TL;DR

The paper tackles the inefficiency of constant-temperature knowledge distillation by introducing energy-based per-sample temperature adaptation. By computing an energy score $\mathcal{E}^{(i)}_\mathcal{T}$ from the teacher and partitioning samples into low-energy (certain) and high-energy (uncertain) groups, EnergyKD applies higher temperatures to low-energy samples and lower temperatures to high-energy samples, improving the transfer of knowledge. It also proposes High Energy-based Data Augmentation (HE-DA), which augments only the high-energy subset to reduce computational costs while boosting performance, with extensive results showing improvements across CIFAR-100, TinyImageNet, ImageNet, and CIFAR-100-LT. The approach yields strong gains over state-of-the-art logits-based and even some feature-based KD methods, highlighting its practical impact for resource-constrained deployment and real-world CV tasks.

Abstract

To apply the latest computer vision techniques that require a large computational cost in real industrial applications, knowledge distillation methods (KDs) are essential. Existing logit-based KDs apply the constant temperature scaling to all samples in dataset, limiting the utilization of knowledge inherent in each sample individually. In our approach, we classify the dataset into two categories (i.e., low energy and high energy samples) based on their energy score. Through experiments, we have confirmed that low energy samples exhibit high confidence scores, indicating certain predictions, while high energy samples yield low confidence scores, meaning uncertain predictions. To distill optimal knowledge by adjusting non-target class predictions, we apply a higher temperature to low energy samples to create smoother distributions and a lower temperature to high energy samples to achieve sharper distributions. When compared to previous logit-based and feature-based methods, our energy-based KD (Energy KD) achieves better performance on various datasets. Especially, Energy KD shows significant improvements on CIFAR-100-LT and ImageNet datasets, which contain many challenging samples. Furthermore, we propose high energy-based data augmentation (HE-DA) for further improving the performance. We demonstrate that higher performance improvement could be achieved by augmenting only a portion of the dataset rather than the entire dataset, suggesting that it can be employed on resource-limited devices. To the best of our knowledge, this paper represents the first attempt to make use of energy function in knowledge distillation and data augmentation, and we believe it will greatly contribute to future research.

Maximizing Discrimination Capability of Knowledge Distillation with Energy Function

TL;DR

The paper tackles the inefficiency of constant-temperature knowledge distillation by introducing energy-based per-sample temperature adaptation. By computing an energy score from the teacher and partitioning samples into low-energy (certain) and high-energy (uncertain) groups, EnergyKD applies higher temperatures to low-energy samples and lower temperatures to high-energy samples, improving the transfer of knowledge. It also proposes High Energy-based Data Augmentation (HE-DA), which augments only the high-energy subset to reduce computational costs while boosting performance, with extensive results showing improvements across CIFAR-100, TinyImageNet, ImageNet, and CIFAR-100-LT. The approach yields strong gains over state-of-the-art logits-based and even some feature-based KD methods, highlighting its practical impact for resource-constrained deployment and real-world CV tasks.

Abstract

To apply the latest computer vision techniques that require a large computational cost in real industrial applications, knowledge distillation methods (KDs) are essential. Existing logit-based KDs apply the constant temperature scaling to all samples in dataset, limiting the utilization of knowledge inherent in each sample individually. In our approach, we classify the dataset into two categories (i.e., low energy and high energy samples) based on their energy score. Through experiments, we have confirmed that low energy samples exhibit high confidence scores, indicating certain predictions, while high energy samples yield low confidence scores, meaning uncertain predictions. To distill optimal knowledge by adjusting non-target class predictions, we apply a higher temperature to low energy samples to create smoother distributions and a lower temperature to high energy samples to achieve sharper distributions. When compared to previous logit-based and feature-based methods, our energy-based KD (Energy KD) achieves better performance on various datasets. Especially, Energy KD shows significant improvements on CIFAR-100-LT and ImageNet datasets, which contain many challenging samples. Furthermore, we propose high energy-based data augmentation (HE-DA) for further improving the performance. We demonstrate that higher performance improvement could be achieved by augmenting only a portion of the dataset rather than the entire dataset, suggesting that it can be employed on resource-limited devices. To the best of our knowledge, this paper represents the first attempt to make use of energy function in knowledge distillation and data augmentation, and we believe it will greatly contribute to future research.
Paper Structure (19 sections, 13 equations, 7 figures, 9 tables)

This paper contains 19 sections, 13 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Schematic diagram of conventional knowledge distillation and our method: (a) constant temperature scaling, (b) different temperature scaling. Our method receives the energy score of each sample from the blue dashed line.
  • Figure 2: ImageNet samples categorized according to their energy scores obtained from ResNet32x4. The red boxes belong to the certain images and have low energy scores, accurately representing their assigned labels. The green boxes are relative to the uncertain images and have high energy scores, not clearly reflecting their assigned labels.
  • Figure 3: Average predictions for particular classes with low energy (blue line) and high energy (red line) samples. Low energy samples exhibit high confidence scores and lack substantial dark knowledge, whereas high energy samples display low confidence scores and have inordinate knowledge.
  • Figure 4: Energy distribution across the entire datasets. This illustrated example assumes that there are 10 image samples and sets the percentage of the total samples to 40%.
  • Figure 5: Performance variations according to the sample types: low, high, and mixed energy. (a): VGG13/MobileNetV2, (b): ResNet32x4/ShuffleNetV2
  • ...and 2 more figures