Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective
Bowen Zheng, Ran Cheng
TL;DR
This work rethinks decoupled knowledge distillation (DKD) through a predictive distribution lens and introduces Generalized DKD (GDKD), a flexible, two-level logit partitioning framework. The GDKD loss decouples logits into top-level and leaf-level components with tunable weights, enabling efficient handling of multimodal teacher predictions and enhanced learning from non-top logits. Empirical analysis reveals that partitioning by the top logit strengthens non-top logit relationships and that increasing emphasis on non-top distillation boosts knowledge extraction, leading to superior performance across CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes compared to DKD and many feature-based methods. The proposed vanilla GDKD algorithm achieves a favorable balance between accuracy and training speed without extra parameters, with extensions like GDKD3 for transformers and combinations with Logit Standardization further boosting results.
Abstract
In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.
