Table of Contents
Fetching ...

Decoupling Dark Knowledge via Block-wise Logit Distillation for Feature-level Alignment

Chengting Yu, Fengzhao Zhang, Ruizhe Chen, Aili Wang, Zuozhu Liu, Shurun Tan, Er-Ping Li

TL;DR

Block-KD addresses the long-standing question of how to best transfer knowledge from a teacher to a student by reconciling logit-based and feature-based distillation. It introduces block-wise logit distillation with implicit feature alignment via stepping-stone models that gradually substitute teacher blocks with the student. The approach yields competitive or superior results on visual tasks and NLP, with a lightweight variant that maintains accuracy while reducing cost. This work highlights the potential of combining logits and features and provides a practical, scalable framework for more effective knowledge distillation across domains.

Abstract

Knowledge Distillation (KD), a learning manner with a larger teacher network guiding a smaller student network, transfers dark knowledge from the teacher to the student via logits or intermediate features, with the aim of producing a well-performed lightweight model. Notably, many subsequent feature-based KD methods outperformed the earliest logit-based KD method and iteratively generated numerous state-of-the-art distillation methods. Nevertheless, recent work has uncovered the potential of the logit-based method, bringing the simple KD form based on logits back into the limelight. Features or logits? They partially implement the KD with entirely distinct perspectives; therefore, choosing between logits and features is not straightforward. This paper provides a unified perspective of feature alignment in order to obtain a better comprehension of their fundamental distinction. Inheriting the design philosophy and insights of feature-based and logit-based methods, we introduce a block-wise logit distillation framework to apply implicit logit-based feature alignment by gradually replacing teacher's blocks as intermediate stepping-stone models to bridge the gap between the student and the teacher. Our method obtains comparable or superior results to state-of-the-art distillation methods. This paper demonstrates the great potential of combining logit and features, and we hope it will inspire future research to revisit KD from a higher vantage point.

Decoupling Dark Knowledge via Block-wise Logit Distillation for Feature-level Alignment

TL;DR

Block-KD addresses the long-standing question of how to best transfer knowledge from a teacher to a student by reconciling logit-based and feature-based distillation. It introduces block-wise logit distillation with implicit feature alignment via stepping-stone models that gradually substitute teacher blocks with the student. The approach yields competitive or superior results on visual tasks and NLP, with a lightweight variant that maintains accuracy while reducing cost. This work highlights the potential of combining logits and features and provides a practical, scalable framework for more effective knowledge distillation across domains.

Abstract

Knowledge Distillation (KD), a learning manner with a larger teacher network guiding a smaller student network, transfers dark knowledge from the teacher to the student via logits or intermediate features, with the aim of producing a well-performed lightweight model. Notably, many subsequent feature-based KD methods outperformed the earliest logit-based KD method and iteratively generated numerous state-of-the-art distillation methods. Nevertheless, recent work has uncovered the potential of the logit-based method, bringing the simple KD form based on logits back into the limelight. Features or logits? They partially implement the KD with entirely distinct perspectives; therefore, choosing between logits and features is not straightforward. This paper provides a unified perspective of feature alignment in order to obtain a better comprehension of their fundamental distinction. Inheriting the design philosophy and insights of feature-based and logit-based methods, we introduce a block-wise logit distillation framework to apply implicit logit-based feature alignment by gradually replacing teacher's blocks as intermediate stepping-stone models to bridge the gap between the student and the teacher. Our method obtains comparable or superior results to state-of-the-art distillation methods. This paper demonstrates the great potential of combining logit and features, and we hope it will inspire future research to revisit KD from a higher vantage point.

Paper Structure

This paper contains 20 sections, 36 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Method Illustration. (a) Logit distillation transfers the entire dark knowledge based solely on output logits. (b) Feature distillation further implements feature alignment at multiple levels. (c) The proposed block-wise logit distillation framework accomplishes implicit feature alignment via output logits of stepping-stones. We refine the distillation into logit-only objectives, with the transfer of dark knowledge decoupled at the block level.
  • Figure 2: Implementations of Feature Alignment. (a-b) Methods typically employed in feature distillation. (c) The proposed consolidation of alignments with identical projections. (d) Implicit feature alignment utilizing logit distillation with stepping stones. The further extension of implicit logit-based feature alignment is in Fig. \ref{['fig3']}.
  • Figure 3: Framework Overview of Block-KD. The stepping stones are executed implicitly within the fundamental dataflow to generate step-by-step logits. The final objectives are defined merely in terms of output logits.
  • Figure 4: Visualization Results with the R32x4 & R8x4 Pair on CIFAR-100. (a) Cost comparison. (b) Validation results during training. The green line represents results obtained by adding $L^N_{task}$ to the baseline; the blue line further adds $L^N_{distill}$; and the red line incorporates $L_{cross}$ at the 200$^{th}$ epoch upon the blue line.