Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning
Seonghak Kim, Gyeongdo Ham, Yucheol Cho, Daeshik Kim
TL;DR
This work tackles the limitations of KL-based knowledge distillation, which can misallocate target and dark knowledge depending on teacher entropy. It introduces Robustness-Reinforced Knowledge Distillation (R2KD), which replaces KL with a correlation-distance loss that combines value-based and rank-based similarities between teacher and student predictions, and augments this with a pruned teacher to boost robustness to hard and augmented samples. The method is validated across CIFAR-100, ImageNet, FGVR, and TinyImageNet, showing consistent gains over state-of-the-art KD methods, with particular resilience when data augmentation is used. The approach offers practical benefits for deploying compact models on resource-constrained devices by enabling robust knowledge transfer and better handling of challenging inputs without additional training of the pruned teacher.
Abstract
The improvement in the performance of efficient and lightweight models (i.e., the student model) is achieved through knowledge distillation (KD), which involves transferring knowledge from more complex models (i.e., the teacher model). However, most existing KD techniques rely on Kullback-Leibler (KL) divergence, which has certain limitations. First, if the teacher distribution has high entropy, the KL divergence's mode-averaging nature hinders the transfer of sufficient target information. Second, when the teacher distribution has low entropy, the KL divergence tends to excessively focus on specific modes, which fails to convey an abundant amount of valuable knowledge to the student. Consequently, when dealing with datasets that contain numerous confounding or challenging samples, student models may struggle to acquire sufficient knowledge, resulting in subpar performance. Furthermore, in previous KD approaches, we observed that data augmentation, a technique aimed at enhancing a model's generalization, can have an adverse impact. Therefore, we propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning. This approach enables KD to effectively incorporate data augmentation for performance improvement. Extensive experiments on various datasets, including CIFAR-100, FGVR, TinyImagenet, and ImageNet, demonstrate our method's superiority over current state-of-the-art methods.
