Table of Contents
Fetching ...

Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels

Hyeonsu Jeong, Hye Won Chung

TL;DR

This work analyzes self-distillation for multi-class classification when a fixed feature extractor is used (linear probing). It demonstrates that multi-round distillation effectively performs label averaging among highly correlated instances, guided by the Gram matrix's eigenvectors, which leads to clustering of predictions and improved generalization in the presence of label noise. The authors derive conditions under which the distillation process can achieve 100% population accuracy and introduce a novel single-round Partial Label Learning (PLL) method that refines the teacher's top predictions into a two-label target set, replicating the benefits of multi-round distillation with far lower computational cost. Empirical results on several real and synthetic datasets corroborate the theory, showing PLL's strong performance in high-noise regimes and confirming the practicality of leveraging self-distillation in linear-probing setups with foundation-model features.

Abstract

We investigate the mechanisms of self-distillation in multi-class classification, particularly in the context of linear probing with fixed feature extractors where traditional feature learning explanations do not apply. Our theoretical analysis reveals that multi-round self-distillation effectively performs label averaging among instances with high feature correlations, governed by the eigenvectors of the Gram matrix derived from input features. This process leads to clustered predictions and improved generalization, mitigating the impact of label noise by reducing the model's reliance on potentially corrupted labels. We establish conditions under which multi-round self-distillation achieves 100% population accuracy despite label noise. Furthermore, we introduce a novel, efficient single-round self-distillation method using refined partial labels from the teacher's top two softmax outputs, referred to as the PLL student model. This approach replicates the benefits of multi-round distillation in a single round, achieving comparable or superior performance--especially in high-noise scenarios--while significantly reducing computational cost.

Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels

TL;DR

This work analyzes self-distillation for multi-class classification when a fixed feature extractor is used (linear probing). It demonstrates that multi-round distillation effectively performs label averaging among highly correlated instances, guided by the Gram matrix's eigenvectors, which leads to clustering of predictions and improved generalization in the presence of label noise. The authors derive conditions under which the distillation process can achieve 100% population accuracy and introduce a novel single-round Partial Label Learning (PLL) method that refines the teacher's top predictions into a two-label target set, replicating the benefits of multi-round distillation with far lower computational cost. Empirical results on several real and synthetic datasets corroborate the theory, showing PLL's strong performance in high-noise regimes and confirming the practicality of leveraging self-distillation in linear-probing setups with foundation-model features.

Abstract

We investigate the mechanisms of self-distillation in multi-class classification, particularly in the context of linear probing with fixed feature extractors where traditional feature learning explanations do not apply. Our theoretical analysis reveals that multi-round self-distillation effectively performs label averaging among instances with high feature correlations, governed by the eigenvectors of the Gram matrix derived from input features. This process leads to clustered predictions and improved generalization, mitigating the impact of label noise by reducing the model's reliance on potentially corrupted labels. We establish conditions under which multi-round self-distillation achieves 100% population accuracy despite label noise. Furthermore, we introduce a novel, efficient single-round self-distillation method using refined partial labels from the teacher's top two softmax outputs, referred to as the PLL student model. This approach replicates the benefits of multi-round distillation in a single round, achieving comparable or superior performance--especially in high-noise scenarios--while significantly reducing computational cost.
Paper Structure (46 sections, 8 theorems, 128 equations, 18 figures, 56 tables, 1 algorithm)

This paper contains 46 sections, 8 theorems, 128 equations, 18 figures, 56 tables, 1 algorithm.

Key Result

Theorem 2.1

As the number of self-distillation rounds $t \in \mathbb{Z}^{+}$ increases, the output ${\mathbf{Y}}^{(t)} = [{\mathbf{y}}_1^{(t)}, \dots, {\mathbf{y}}_{Kn}^{(t)}] \in \mathbb{R}^{K\times Kn}$ of the $t$-th distilled model for the inputs of the training dataset $\{\phi({\mathbf{x}}_i), \hat{y}_i\}_{ where ${\mathbf{Y}}^{(0)} = [{\mathbf{e}}({\hat{y}_1}), \cdots, {\mathbf{e}}({\hat{y}_{Kn}})]$ r

Figures (18)

  • Figure 1: The Gram matrix ${\bm{\Phi}}$ of 50,000 instances of CIFAR-100, extracted from a ResNet34 network, pre-trained on ImageNet. Similar block-wise structures of the Gram matrix are observed for six different real datasets in Fig. \ref{['fig:ft_corr']}.
  • Figure 2: (a) The evolution of the eigenvalues of ${\bm{\Phi}}^{(t)}$ as the distillation rounds $t$ increases. Notably, only $K(=4)$ top-eigenvalues dominate the others as $t$ progresses. (b) A 2D plot showing the softmax outputs of the teacher model and the 4th and 6th student models for the training instances. Each corner represents the one-hot vectors of a class. As the distillation round $t$ increases, instances form distinct clusters based on their true labels (represented by colors in dots), driven by the label averaging among highly correlated input instances within the same class. See Appendix \ref{['app:output_full']} for full evolution.
  • Figure 3: (a) Prediction accuracy of the teacher and student models for different distillation rounds $t$ as the label corruption rate increases. The 100% population accuracy regime expands for student models with more distillation rounds. (b) Softmax outputs for $K=4$ classification as $t$ increases, using the same setup as Fig. \ref{['fig:exp_synth']} with a 50% label corruption rate. The BLUE line shows the average softmax value at the true label, the ORANGE line for noisy samples at the given label, and the GREEN line for other labels. Shaded regions show $\pm 1$ standard deviation. For clean samples, the true label maintains the highest value, while for noisy samples, the true label starts lower than the given label but surpasses it after a few rounds. The shaded region shrinks with $t$ due to label clustering.
  • Figure 4: Distillation gain in accuracy (%) by each student model compared to the teacher in the superclass label corruption scenario. PLAIN represents the improvement of the multi-round students over the teacher model, while ORANGE represents the improvement by the PLL student model.
  • Figure 5: (a) Visualization of the softmax outputs of the teacher model, and the first and fourth student models on the CIFAR-100 dataset, reduced to four-class classification outputs for clarity. (b) Accuracy improvement (%) for each PLL student model using top-$k$ partial labels compared to the teacher model in the superclass label corruption scenario on the CIFAR-100 dataset.
  • ...and 13 more figures

Theorems & Definitions (13)

  • Theorem 2.1: Informal version of Thm. \ref{['thm:Q1']}
  • Theorem 2.2: Informal version of Thm. \ref{['thm:self']}
  • Theorem 2.3: Informal ver. of Thm. \ref{['thm:pll']}
  • Theorem 3.1
  • Theorem 4.1
  • proof : Proof Sketch of Theorem \ref{['thm:self']}
  • Lemma 4.1
  • Theorem 5.1
  • proof : Proof Sketch
  • Remark 1: Comparison with Multi-Round Self-Distillation
  • ...and 3 more