Table of Contents
Fetching ...

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Bowen Zheng, Ran Cheng

TL;DR

This work rethinks decoupled knowledge distillation (DKD) through a predictive distribution lens and introduces Generalized DKD (GDKD), a flexible, two-level logit partitioning framework. The GDKD loss decouples logits into top-level and leaf-level components with tunable weights, enabling efficient handling of multimodal teacher predictions and enhanced learning from non-top logits. Empirical analysis reveals that partitioning by the top logit strengthens non-top logit relationships and that increasing emphasis on non-top distillation boosts knowledge extraction, leading to superior performance across CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes compared to DKD and many feature-based methods. The proposed vanilla GDKD algorithm achieves a favorable balance between accuracy and training speed without extra parameters, with extensions like GDKD3 for transformers and combinations with Logit Standardization further boosting results.

Abstract

In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

TL;DR

This work rethinks decoupled knowledge distillation (DKD) through a predictive distribution lens and introduces Generalized DKD (GDKD), a flexible, two-level logit partitioning framework. The GDKD loss decouples logits into top-level and leaf-level components with tunable weights, enabling efficient handling of multimodal teacher predictions and enhanced learning from non-top logits. Empirical analysis reveals that partitioning by the top logit strengthens non-top logit relationships and that increasing emphasis on non-top distillation boosts knowledge extraction, leading to superior performance across CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes compared to DKD and many feature-based methods. The proposed vanilla GDKD algorithm achieves a favorable balance between accuracy and training speed without extra parameters, with extensions like GDKD3 for transformers and combinations with Logit Standardization further boosting results.

Abstract

In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.

Paper Structure

This paper contains 32 sections, 19 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: Logit-based knowledge distillation framework. The student model is supervised by the large pre-trained teacher's soft labels and target labels to enhance its performance. Right: Proposed GDKD loss under two-level decomposition with two partitions. Logits from the teacher $\mathcal{T}$ and student $\mathcal{S}$ are separately partitioned into two groups via class sets $\{\mathbb{T}, \setminus \mathbb{T}\}$, where $\mathbb{T}$ is a subset of class indices $\{1,\dots,C \}$. Then the distillation loss is calculated based on decoupled loss terms at the top-level binary distribution pair $(\bm{b}^\mathcal{T},\bm{b}^\mathcal{S})$ and the low-level distribution pairs $(\bm{p}_\mathbb{T}^\mathcal{T},\bm{p}_\mathbb{T}^\mathcal{S})$, $(\bm{p}_{\setminus \mathbb{T}}^\mathcal{T},\bm{p}_{\setminus \mathbb{T}}^\mathcal{S})$.
  • Figure 2: Illustration of logit partition strategies in DKD and GDKD. DKD utilizes a singular criterion, specifically the target label $t$, for logit partitioning. In contrast, GDKD employs a more flexible and sophisticated strategy, utilizing any mutually exclusive sets $\mathbb{T}$ and $\setminus \mathbb{T}$ for partitioning.
  • Figure 3: Gradient magnitude distribution of non-top logits (other) in GDKD-top1 on CIFAR-100. The figure displays the distribution of non-top logits' average gradient magnitudes across training epochs under GDKD-top1 (where $\mathbb{T} = \{\arg\max_i (z_i^\mathcal{T})\}$). It includes distributions for $\mathcal{L}_\text{TopKD}$ (other-TopKD), $\beta\mathcal{L}_\text{OtherKD}$ (other-OtherKD), and the coupled term $(1-p_c^\mathcal{T}) \mathcal{L}_\text{OtherKD}$ (other-KD) from the traditional KD loss for comparison.
  • Figure 4: Average class predictions by teacher model ResNet32x4 on CIFAR-100. The figure compares the average predictions for two randomly selected classes, 'boy' and 'shark', using (a) standard data augmentation and (b) AutoAugment cubukAutoAugmentLearningAugmentation2019 on CIFAR-100's training set with a temperature of $T=4$. Top 5 predictions for each class are highlighted in different colors.
  • Figure 5: The average predictions of teacher ResNet32x4 over one class of CIFAR-100's training dataset with temperature $T=4$, where the top1 predictions are ignored. (a): The predictions $\bm{p}^\mathcal{T}$ over one class without plotting the top1 value; (b): The reconstructed predictions of remaining logits $\bm{p}_{\setminus \mathbb{T}}^\mathcal{T}$ through softmax. The teacher is trained with standard data augmentations. The plot randomly selects 2 classes and marks the top 4 predictions with different colors.
  • ...and 5 more figures

Theorems & Definitions (2)

  • proof
  • proof