Table of Contents
Fetching ...

Cross-View Consistency Regularisation for Knowledge Distillation

Weijia Zhang, Dongnan Liu, Weidong Cai, Chao Ma

TL;DR

CRLD reframes knowledge distillation as a cross-view consistency problem, introducing within-view and cross-view logit regularisations together with confidence-based soft label mining to address overconfidence and confirmation bias. It uses weak and strong augmentations to generate multiple views and enforces both within-view and cross-view consistency in logit space, selecting reliable teacher signals via thresholds. The approach yields state-of-the-art results on CIFAR-100, Tiny-ImageNet, and ImageNet across diverse teacher–student architectures and provides consistent gains when applied to existing KD methods, all without adding extra parameters. Empirically, CRLD demonstrates strong generalisation, robustness to label scarcity, and applicability to transformer-based backbones, highlighting a practical, parameter-free enhancement for logit-based KD.

Abstract

Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, previous research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins.

Cross-View Consistency Regularisation for Knowledge Distillation

TL;DR

CRLD reframes knowledge distillation as a cross-view consistency problem, introducing within-view and cross-view logit regularisations together with confidence-based soft label mining to address overconfidence and confirmation bias. It uses weak and strong augmentations to generate multiple views and enforces both within-view and cross-view consistency in logit space, selecting reliable teacher signals via thresholds. The approach yields state-of-the-art results on CIFAR-100, Tiny-ImageNet, and ImageNet across diverse teacher–student architectures and provides consistent gains when applied to existing KD methods, all without adding extra parameters. Empirically, CRLD demonstrates strong generalisation, robustness to label scarcity, and applicability to transformer-based backbones, highlighting a practical, parameter-free enhancement for logit-based KD.

Abstract

Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, previous research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins.

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: A schematic comparison of logit-based distillation methods from a cross-view learning perspective: (a) Methods mimicking logits in an unitary view kdtakdctkddkdnkdmlldnormkd. (b) Methods optimising and mimicking contrastive relations sskd. (c) The proposed CRLD which involves within-view and cross-view logit transfer.
  • Figure 2: The CRLD framework. An input image is transformed into a weakly-transformed view and a strongly-transformed view. Both views are fed into the teacher and the student separately, yielding four predictions of the same instance. Amongst them, two types of consistency regularisation are enforced: within-view (①②) and cross-view (③④). Besides, student's predictions are supervised by ground-truths (⑤) as per standard practice.
  • Figure 3: Evolution of training (top) and test (bottom) set Top-1 accuracy (%) on CIFAR-100.
  • Figure 4: t-SNE visualisation of teacher's and distilled student's features on CIFAR-100.
  • Figure 5: Class-wise similarity maps between teacher and student predictions by NormKD and CRLD on CIFAR-100.
  • ...and 2 more figures