Balance Divergence for Knowledge Distillation

Yafei Qi; Chen Wang; Zhaoning Zhang; Yaping Liu; Yongmin Zhang

Balance Divergence for Knowledge Distillation

Yafei Qi, Chen Wang, Zhaoning Zhang, Yaping Liu, Yongmin Zhang

TL;DR

This work tackles the imbalance in knowledge transfer during logit-based KD caused by neglecting tiny negative probabilities in teacher outputs. It proposes Balance Divergence Distillation (BDD), which combines forward KL with reverse KL and uses temperature-based balancing to jointly emphasize positive and negative regions of the teacher distribution. The method demonstrates consistent gains on image classification (CIFAR-100, ImageNet) and semantic segmentation (Cityscapes), including notable mIoU improvements, and proves easy to integrate with existing KD approaches. By addressing dark knowledge in a principled way, BDD offers a simple, effective baseline for improving the distillation of knowledge to lightweight student models across vision tasks.

Abstract

Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.

Balance Divergence for Knowledge Distillation

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 16 sections, 11 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Logit-based Knowledge Distillation
Feature-based and Relation-based Knowledge Distillation
Proposed Method
Preliminary
Balance Divergence Distillation
Balance of Temperature Between Forward and Reverse
Effects of BDD on Dense Prediction
Overall loss
Experiments
Experimental Setup
Experiments on Classification
Experiments on Dense Prediction
Ablation Study
...and 1 more sections

Figures (5)

Figure 1: Comparison of output probability fitting by normal KD and BDD. As shown in figure a is normal KD, the minuscule values can not learn well by student. Figure b is our proposed Balance Divergence Distillation(BDD), we highlight the positive and negative regions of the teacher's probability outputs through temperature coefficient scaling, and then use different KL divergences in BDD to mimic the model's outputs.
Figure 2: Comparison of output probability fitting by normal KD and BDD. This image illustrates the logit output fitting by KD and BDD on CIFAR100. The blue histogram shows the student's logit output, while the orange histogram represents the teacher's logit output. The left chart reveals that standard KD tends to overfit the positives of the teacher, neglecting the negatives and resulting in suboptimal learning outcomes. In contrast, the right figure demonstrates that BDD adeptly fits the teacher's negative regions. For enhanced visualization of minimal values, a temperature of 4.0 is applied for output smoothing and to clip the highest value.
Figure 3: Unbalance in the output feature map of semantic segmentation. The image shows output feature maps and channel attention probability maps for a semantic segmentation task on the Cityscapes dataset. The red boxes at the top and bottom of the image represent the selected regions' output features and their corresponding channel attention probability distributions. The vertical axis of the channel attention indicates the probability values, while the horizontal axis represents the flattened pixel values.
Figure 4: Qualitative segmentation results and channel distributions. This figure displays the output feature maps and channel attention probability maps of PSPNet-R18 for a semantic segmentation task on the Cityscapes dataset. (a) raw images, (b) ground truth(GT), (c)our method(BDD), (d)channel attention probability of our method(BDD$^*$), (e)channel-wise distillation(CWD), (f)channel attention probability of CWD(CWD$^*$). The selected regions indicate the segmentation quality of the different knowledge distillation methods.
Figure 5: Top-1 accuracy of training on validation and train sets. The red line represents the accuracy of BDD, the blue line represents the accuracy of our DKD, and the yellow line represents the accuracy of KD.

Balance Divergence for Knowledge Distillation

TL;DR

Abstract

Balance Divergence for Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)