Scale Decoupled Distillation

Shicai Wei Chunbo Luo Yang Luo

Scale Decoupled Distillation

Shicai Wei Chunbo Luo Yang Luo

TL;DR

The paper identifies a core limitation of logit-based knowledge distillation: global logits entangle diverse semantic signals, which can transfer ambiguous knowledge to the student. It proposes Scale Decoupled Distillation (SDD), which uses multi-scale pooling to extract local logits and then splits these into consistent and complementary components, formalized by $L_{SDD}=D_{con}+\beta D_{com}$ and the overall objective $L_{1}=L_{CE}+\alpha L_{SDD}$. Empirical results across CIFAR-100, CUB200, and ImageNet show that SDD consistently improves performance for various teacher–student pairs, with pronounced gains in fine-grained tasks, while maintaining computational efficiency. The method is compatible with existing logit-based losses and can be integrated with feature distillation, offering a practical enhancement for distillation pipelines. Overall, SDD provides a scalable route to richer, less ambiguous knowledge transfer in logit-based distillation.

Abstract

Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: https://github.com/shicaiwei123/SDD-CVPR2024

Scale Decoupled Distillation

TL;DR

and the overall objective

. Empirical results across CIFAR-100, CUB200, and ImageNet show that SDD consistently improves performance for various teacher–student pairs, with pronounced gains in fine-grained tasks, while maintaining computational efficiency. The method is compatible with existing logit-based losses and can be integrated with feature distillation, offering a practical enhancement for distillation pipelines. Overall, SDD provides a scalable route to richer, less ambiguous knowledge transfer in logit-based distillation.

Abstract

Paper Structure (15 sections, 6 equations, 6 figures, 9 tables)

This paper contains 15 sections, 6 equations, 6 figures, 9 tables.

Introduction
Related Work
Feature-based Distillation
Logit-based Distillation
Method
Conventional Knowledge Distillation
Scale Decoupled Knowledge Distillation
Experiments
Experimental Setups
Comparison Results
Ablation Study
Conclusion
Appendix
Ablation Study
Discussion

Figures (6)

Figure 1: Image visualization on ImageNet. (a) The top line shows some misclassified samples of class 6 in ResNet34. The bottom line displays their corresponding predicted class and sample. (b) Illustrates the intuitive model for scale decoupling.
Figure 2: Illustration of the conventional KD (a) and our SDD (b). Compared with the conventional KD that only considers the global logit knowledge via global average pooling, SDD proposes to capture the multi-scale logit knowledge via the multi-scale pooling so that the student can inherit the fine-grained and unambiguous semantic knowledge from the teacher.
Figure 3: Difference of correlation matrices of student and teacher logits of KD (left) and SD-KD (right).
Figure 4: t-SNE of features learned by KD (left) and SD-KD (right).
Figure 5: Some examples that can be classified correctly by the student trained with SD-KD while misclassified by the student trained with conventional KD.
...and 1 more figures

Scale Decoupled Distillation

TL;DR

Abstract

Scale Decoupled Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)