Cross-View Consistency Regularisation for Knowledge Distillation
Weijia Zhang, Dongnan Liu, Weidong Cai, Chao Ma
TL;DR
CRLD reframes knowledge distillation as a cross-view consistency problem, introducing within-view and cross-view logit regularisations together with confidence-based soft label mining to address overconfidence and confirmation bias. It uses weak and strong augmentations to generate multiple views and enforces both within-view and cross-view consistency in logit space, selecting reliable teacher signals via thresholds. The approach yields state-of-the-art results on CIFAR-100, Tiny-ImageNet, and ImageNet across diverse teacher–student architectures and provides consistent gains when applied to existing KD methods, all without adding extra parameters. Empirically, CRLD demonstrates strong generalisation, robustness to label scarcity, and applicability to transformer-based backbones, highlighting a practical, parameter-free enhancement for logit-based KD.
Abstract
Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, previous research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins.
