DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer
Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
TL;DR
DeepKD tackles gradient conflicts in knowledge distillation by decomposing the student gradient into three components: TOG, TCG, and NCG, and by aligning their optimization via GSNR-driven momentum. It introduces a dual-level decoupling of gradient flows and a curriculum-inspired dynamic top-k masking to denoise dark knowledge, grounding the approach in gradient signal-to-noise ratio theory. Evaluations on CIFAR-100, ImageNet-1K, and MS-COCO show consistent, state-of-the-art gains when integrating with existing logit-based KD methods, with the dynamic masking yielding additional improvements. Collectively, DeepKD provides a principled optimization framework that improves transfer efficiency, generalization, and compatibility with diverse KD settings.
Abstract
Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.
