Scale Decoupled Distillation
Shicai Wei Chunbo Luo Yang Luo
TL;DR
The paper identifies a core limitation of logit-based knowledge distillation: global logits entangle diverse semantic signals, which can transfer ambiguous knowledge to the student. It proposes Scale Decoupled Distillation (SDD), which uses multi-scale pooling to extract local logits and then splits these into consistent and complementary components, formalized by $L_{SDD}=D_{con}+\beta D_{com}$ and the overall objective $L_{1}=L_{CE}+\alpha L_{SDD}$. Empirical results across CIFAR-100, CUB200, and ImageNet show that SDD consistently improves performance for various teacher–student pairs, with pronounced gains in fine-grained tasks, while maintaining computational efficiency. The method is compatible with existing logit-based losses and can be integrated with feature distillation, offering a practical enhancement for distillation pipelines. Overall, SDD provides a scalable route to richer, less ambiguous knowledge transfer in logit-based distillation.
Abstract
Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: https://github.com/shicaiwei123/SDD-CVPR2024
