Table of Contents
Fetching ...

Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

Penghui Yang, Chen-Chen Zong, Sheng-Jun Huang, Lei Feng, Bo An

TL;DR

The paper addresses information loss when distilling knowledge via probabilities by introducing a logit-level loss $L_{BinaryKL}$, which, when combined with $L_{CE}$, causes gradient conflicts under neural collapse. It theoretically analyzes why the two losses conflict on the linear classifier while the backbone benefits from the logit-level signal, and proposes Dual-Head Knowledge Distillation (DHKD) to decouple the classifier into two heads, with an auxiliary head dedicated to the logit-level loss and a stabilized BinaryKL-Norm to align gradients. DHKD preserves the backbone improvements while avoiding degradation in the classifier, validated by extensive experiments on CIFAR-100 and ImageNet; results show DHKD outperforms state-of-the-art KD methods and complements feature-based KD like ReviewKD. The approach is modular (decoupled heads, gradient alignment), flexible for cross-architecture distillation, and demonstrates strong scalability, including applications to semantic segmentation in Cityscapes and broader practical impact in resource-constrained deployment.

Abstract

Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts. Our code is available at: https://github.com/penghui-yang/DHKD.

Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

TL;DR

The paper addresses information loss when distilling knowledge via probabilities by introducing a logit-level loss , which, when combined with , causes gradient conflicts under neural collapse. It theoretically analyzes why the two losses conflict on the linear classifier while the backbone benefits from the logit-level signal, and proposes Dual-Head Knowledge Distillation (DHKD) to decouple the classifier into two heads, with an auxiliary head dedicated to the logit-level loss and a stabilized BinaryKL-Norm to align gradients. DHKD preserves the backbone improvements while avoiding degradation in the classifier, validated by extensive experiments on CIFAR-100 and ImageNet; results show DHKD outperforms state-of-the-art KD methods and complements feature-based KD like ReviewKD. The approach is modular (decoupled heads, gradient alignment), flexible for cross-architecture distillation, and demonstrates strong scalability, including applications to semantic segmentation in Cityscapes and broader practical impact in resource-constrained deployment.

Abstract

Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts. Our code is available at: https://github.com/penghui-yang/DHKD.

Paper Structure

This paper contains 33 sections, 6 theorems, 27 equations, 6 figures, 13 tables.

Key Result

Proposition 1

The gradient of $\mathcal{L}_{\mathrm{overall}}$ w.r.t. the linear classifier can be formulated as follows: where

Figures (6)

  • Figure 1: The reason for introducing the BinaryKL loss and the incompatibility of the CE loss and the BinaryKL loss. Figure \ref{['fig:logit']} shows that two different vectors may become the same through the softmax function, which means some information would be lost during the transformation process carried out by the softmax function. Figure \ref{['fig:incomp-illu']} shows four settings we evaluate: (1) only cross-entropy (CE) loss; (2) only the BinaryKL loss; (3) CE + BinaryKL; (4) Our proposed Dual-Head Knowledge Distillation (DHKD). The red cross over the third setting means that the student model trained under this setting will collapse. The performance sorting of four settings is (4) $>$ (1) $>$ (2) $\gg$ (3), as shown in Figure \ref{['fig:incomp-curves']}.
  • Figure 2: The test accuracy and test loss curves of the student models during the training phase. We set resnet56 as the teacher and resnet20 as the student on the CIFAR-100 dataset.
  • Figure 3: Gradient analysis based on the neural collapse theory. (a) An illustration of a simplex ETF when $d = 3$ and $K = 4$. The "+" and "$\text{☆}$" with different colors refer to features and classifier vectors of different classes, respectively. (b) Gradient directions about a certain $\bm{h}$ (belongs to the 1-st class) w.r.t. all $\bm{w}_i, i\in {1,2,3}$. (c) Gradient directions w.r.t. an $\bm{h}$ (belongs to the 1-st class).
  • Figure 4: An illustration of Dual-Head Knowledge Distillation (DHKD). DHKD decouples the original linear classifier into a duo. We can evade conflicts between two losses by introducing an Adaptive Auxiliary Classifier customized according to the models' architectures. The right side shows the gradient alignment method on CIFAR-100. When the angle between two gradients is larger than $90^{\circ}$, we will project the gradient of the BinaryKL loss to the orthogonal direction of the gradient of the CE loss.
  • Figure 5: t-SNE visualization of features learned by different methods. We do the visualization on the test set of CIFAR-100. We set resnet32$\times$4 as the teacher and resnet8$\times$4 as the student.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Proposition 2
  • Definition 1: Simplex Equiangular Tight Frame papyan2020prevalence
  • Theorem 1: papyan2020prevalence
  • Lemma 1
  • Proposition 3
  • Proposition 4