Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

Seonghak Kim; Gyeongdo Ham; Yucheol Cho; Daeshik Kim

Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

Seonghak Kim, Gyeongdo Ham, Yucheol Cho, Daeshik Kim

TL;DR

This work tackles the limitations of KL-based knowledge distillation, which can misallocate target and dark knowledge depending on teacher entropy. It introduces Robustness-Reinforced Knowledge Distillation (R2KD), which replaces KL with a correlation-distance loss that combines value-based and rank-based similarities between teacher and student predictions, and augments this with a pruned teacher to boost robustness to hard and augmented samples. The method is validated across CIFAR-100, ImageNet, FGVR, and TinyImageNet, showing consistent gains over state-of-the-art KD methods, with particular resilience when data augmentation is used. The approach offers practical benefits for deploying compact models on resource-constrained devices by enabling robust knowledge transfer and better handling of challenging inputs without additional training of the pruned teacher.

Abstract

The improvement in the performance of efficient and lightweight models (i.e., the student model) is achieved through knowledge distillation (KD), which involves transferring knowledge from more complex models (i.e., the teacher model). However, most existing KD techniques rely on Kullback-Leibler (KL) divergence, which has certain limitations. First, if the teacher distribution has high entropy, the KL divergence's mode-averaging nature hinders the transfer of sufficient target information. Second, when the teacher distribution has low entropy, the KL divergence tends to excessively focus on specific modes, which fails to convey an abundant amount of valuable knowledge to the student. Consequently, when dealing with datasets that contain numerous confounding or challenging samples, student models may struggle to acquire sufficient knowledge, resulting in subpar performance. Furthermore, in previous KD approaches, we observed that data augmentation, a technique aimed at enhancing a model's generalization, can have an adverse impact. Therefore, we propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning. This approach enables KD to effectively incorporate data augmentation for performance improvement. Extensive experiments on various datasets, including CIFAR-100, FGVR, TinyImagenet, and ImageNet, demonstrate our method's superiority over current state-of-the-art methods.

Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 7 figures, 10 tables)

This paper contains 26 sections, 10 equations, 7 figures, 10 tables.

Introduction
Related Works
Knowledge Distillation
Correlation-based Distance
Network Pruning
Proposed Method
Limitation of KL divergence
Case I, $p_i^{\mathcal{T}}=0$
Case II, $p_i^{\mathcal{T}}>0$
Correlation Distance Loss
Pruned Teacher Network
Experiment
Datasets
CIFAR-100 cifar
ImageNet imagenet
...and 11 more sections

Figures (7)

Figure 1: Comparison of distribution with (a) high entropy in teacher distribution and (b) low entropy in teacher distribution. Gray: teacher distribution. Orange: student distribution with KL divergence. Green: student distribution with our correlation distance.
Figure 2: Illustration of the proposed method. The pruned teacher model is a duplicate of the pre-trained teacher model, and the input image is passed through these two models to produce $p^{\mathrm{Pr}}$ and $p^{\mathcal{T}}$ predictions. To address uncertainty in images, these two predictions are combined into a single prediction $p^{\mathcal{T}}$, which is then used to distill knowledge into the student model. We consider both $p^{\mathcal{S}}$ and $p^{\mathcal{T}}$ as vectors, and employ value- and rank-based correlation techniques to make $p^{\mathcal{S}}$ resemble $p^{\mathcal{T}}$.
Figure 3: Understanding of correlation coefficient. Value-based correlation coefficient (denoted as $\rho_{p^{\mathcal{T}}, p^{\mathcal{S}}}$) and rank-based correlation coefficient (denoted as $r_s$) between teacher and student predictions. When only a value-based correlation is applied in KDs, the student's weights are updated to be completely matched with the teacher's predictions (red line, marked as (3)). However, when a rank-based correlation is also applied, the student model can learn to obtain rich information from the teacher (green line, marked as (1)).
Figure 4: Comparison of entropy. Entropy for several classes that have high entropy from teacher model. Left: ResNet32x4-ResNet8x4, Left-Middle: ResNet32x4-ShuffleNetV2, Right-Middle: VGG13-VGG8, Right: ResNet50-MobileNetV2.
Figure 5: Comparison of entropy. Prediction distributions for the samples with high entropy extracted from the testset of CIFAR-100. The teacher is ResNet32x4 and student is ResNet8x4.
...and 2 more figures

Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

TL;DR

Abstract

Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)