Table of Contents
Fetching ...

DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer

Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren

TL;DR

DeepKD tackles gradient conflicts in knowledge distillation by decomposing the student gradient into three components: TOG, TCG, and NCG, and by aligning their optimization via GSNR-driven momentum. It introduces a dual-level decoupling of gradient flows and a curriculum-inspired dynamic top-k masking to denoise dark knowledge, grounding the approach in gradient signal-to-noise ratio theory. Evaluations on CIFAR-100, ImageNet-1K, and MS-COCO show consistent, state-of-the-art gains when integrating with existing logit-based KD methods, with the dynamic masking yielding additional improvements. Collectively, DeepKD provides a principled optimization framework that improves transfer efficiency, generalization, and compatibility with diverse KD settings.

Abstract

Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.

DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer

TL;DR

DeepKD tackles gradient conflicts in knowledge distillation by decomposing the student gradient into three components: TOG, TCG, and NCG, and by aligning their optimization via GSNR-driven momentum. It introduces a dual-level decoupling of gradient flows and a curriculum-inspired dynamic top-k masking to denoise dark knowledge, grounding the approach in gradient signal-to-noise ratio theory. Evaluations on CIFAR-100, ImageNet-1K, and MS-COCO show consistent, state-of-the-art gains when integrating with existing logit-based KD methods, with the dynamic masking yielding additional improvements. Collectively, DeepKD provides a principled optimization framework that improves transfer efficiency, generalization, and compatibility with diverse KD settings.

Abstract

Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.

Paper Structure

This paper contains 19 sections, 37 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Analysis of optimization dynamics and knowledge transfer of ResNet32$\times$4/ResNet8$\times$4 on CIFAR-100: (a) Gradient Signal-to-Noise Ratio (GSNR) comparison across different knowledge distillation methods, (b) Loss landscape visualization Li2018 showing the flatness of minima, and (c) Dynamic top-k masking process for dark knowledge denoising aligns with curriculum learning.
  • Figure 2: Comparison of gradient and buffer SNR between vanilla KD and DeepKD: (a) KD GSNR with less component separation, (b) DeepKD GSNR with better component distinction, (c) KD BSNR with limited separation, and (d) DeepKD BSNR with enhanced component differentiation.
  • Figure 3: Analysis of top-k masking strategy. (a) Distribution of teacher model's confidence on target classes. (b) Accuracy comparison of different static top-k values for knowledge distillation. (c) Learning curve divided into distinct training phases with the optimal top-k masking approach.
  • Figure 4: Detailed architecture of our DeepKD framework. Input images flow through teacher and student networks, producing target (yellow) and non-target (green) logits. The framework uses three independent gradient paths (task-oriented, target-class, and non-target-class) with separate momentum buffers. Dynamic Top-k Mask filters low-confidence non-target logits (gray cells).
  • Figure 5: Difference of student and teacher logits. DeepKD leads to a significantly smaller difference (more similar prediction) than other KD methods.
  • ...and 3 more figures