Table of Contents
Fetching ...

Neural Collapse Inspired Knowledge Distillation

Shuoxi Zhang, Zijian Song, Kun He

TL;DR

Knowledge distillation often suffers from a persistent teacher–student gap. This work introduces Neural Collapse-inspired Knowledge Distillation ($\mathcal{NCKD}$), which explicitly transfers the teacher’s NC structure by enforcing $\mathcal{NC}_\mathbf{1}$ prototype alignment, $\mathcal{NC}_\mathbf{2}$ ETF transfer, and an $\mathcal{NC}_\mathbf{3}$-inspired classifier, achieving improved generalization. The method provides three concrete losses and a plug-in NC3 classifier, and demonstrates strong, consistent gains across CIFAR-100, ImageNet-1k, and MS-COCO, often outperforming state-of-the-art KD baselines. This work highlights the practical value of embedding NC geometry into distillation, offering a robust framework that enhances both accuracy and efficiency in student models and suggesting avenues for future mutual distillation and data-driven teacher selection.

Abstract

Existing knowledge distillation (KD) methods have demonstrated their ability in achieving student network performance on par with their teachers. However, the knowledge gap between the teacher and student remains significant and may hinder the effectiveness of the distillation process. In this work, we introduce the structure of Neural Collapse (NC) into the KD framework. NC typically occurs in the final phase of training, resulting in a graceful geometric structure where the last-layer features form a simplex equiangular tight frame. Such phenomenon has improved the generalization of deep network training. We hypothesize that NC can also alleviate the knowledge gap in distillation, thereby enhancing student performance. This paper begins with an empirical analysis to bridge the connection between knowledge distillation and neural collapse. Through this analysis, we establish that transferring the teacher's NC structure to the student benefits the distillation process. Therefore, instead of merely transferring instance-level logits or features, as done by existing distillation methods, we encourage students to learn the teacher's NC structure. Thereby, we propose a new distillation paradigm termed Neural Collapse-inspired Knowledge Distillation (NCKD). Comprehensive experiments demonstrate that NCKD is simple yet effective, improving the generalization of all distilled student models and achieving state-of-the-art accuracy performance.

Neural Collapse Inspired Knowledge Distillation

TL;DR

Knowledge distillation often suffers from a persistent teacher–student gap. This work introduces Neural Collapse-inspired Knowledge Distillation (), which explicitly transfers the teacher’s NC structure by enforcing prototype alignment, ETF transfer, and an -inspired classifier, achieving improved generalization. The method provides three concrete losses and a plug-in NC3 classifier, and demonstrates strong, consistent gains across CIFAR-100, ImageNet-1k, and MS-COCO, often outperforming state-of-the-art KD baselines. This work highlights the practical value of embedding NC geometry into distillation, offering a robust framework that enhances both accuracy and efficiency in student models and suggesting avenues for future mutual distillation and data-driven teacher selection.

Abstract

Existing knowledge distillation (KD) methods have demonstrated their ability in achieving student network performance on par with their teachers. However, the knowledge gap between the teacher and student remains significant and may hinder the effectiveness of the distillation process. In this work, we introduce the structure of Neural Collapse (NC) into the KD framework. NC typically occurs in the final phase of training, resulting in a graceful geometric structure where the last-layer features form a simplex equiangular tight frame. Such phenomenon has improved the generalization of deep network training. We hypothesize that NC can also alleviate the knowledge gap in distillation, thereby enhancing student performance. This paper begins with an empirical analysis to bridge the connection between knowledge distillation and neural collapse. Through this analysis, we establish that transferring the teacher's NC structure to the student benefits the distillation process. Therefore, instead of merely transferring instance-level logits or features, as done by existing distillation methods, we encourage students to learn the teacher's NC structure. Thereby, we propose a new distillation paradigm termed Neural Collapse-inspired Knowledge Distillation (NCKD). Comprehensive experiments demonstrate that NCKD is simple yet effective, improving the generalization of all distilled student models and achieving state-of-the-art accuracy performance.

Paper Structure

This paper contains 30 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Description of the structure of Neural Collapse. All class features progressively collapse toward their centroids, forming an equiangular, elegant structure. Also, classifier $\Vec{w}$ will align with its corresponding last-layer normalized centroid $\Tilde{\Vec{h}}$.
  • Figure 2: Comparison of NC metrics and prediction performance across different methods. Both networks were distilled from ResNet32x4 on CIFAR-100. The ideal NC results are characterized by $\mathcal{NC}_\mathbf{1,2}$ approaching 0, and $\mathcal{NC}_\mathbf{3}$ approaching 1.
  • Figure 3: The overall framework of our NCKD. We distill the $\mathcal{NC}_\mathbf{1,2}$ from the teacher to the student. We normalize within-class mean $\boldsymbol{h}$ to $\Tilde{\boldsymbol{h}}$ to construct the ETF structure. illustrate $\mathcal{NC}_\mathbf{2}$ distillation using $\Tilde{\boldsymbol{h}}^S_2$ as the example, which replicates the teacher's ETF structure with other classes. $\mathcal{NC}_\mathbf{3}$ classifier is leveraged to reduce computational costs.
  • Figure 4: t-SNE of features learned by several KD methods. We use ResNet-32$\times$4/ResNet-8$\times$4 as the teacher/student pair.
  • Figure 5: Distillation results with standard and $\mathcal{NC}_\mathbf{3}$-inspired classifiers on CIFAR-100, with training time per epoch shown in the right table.
  • ...and 2 more figures

Theorems & Definitions (1)

  • proof