Table of Contents
Fetching ...

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari

TL;DR

The paper tackles the limitations of logit-based supervision in feature knowledge distillation by proposing a logit-free FKD framework that trains the student backbone exclusively via feature-based losses. It introduces a geometry-inspired knowledge quality metric $\mathcal{Q}$, combining separation, information, and efficiency, to automatically select the most informative teacher layers and improve distillation. Across CNNs and ViTs on CIFAR-10/100 and Tiny ImageNet, the method achieves up to 15% top-1 accuracy gains and demonstrates that removing logit losses coupled with $\mathcal{Q}$-driven layer selection yields robust, state-of-the-art KD performance. The approach highlights the importance of latent geometry in knowledge transfer and suggests broader implications for layer selection and training regimes beyond KD.

Abstract

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

TL;DR

The paper tackles the limitations of logit-based supervision in feature knowledge distillation by proposing a logit-free FKD framework that trains the student backbone exclusively via feature-based losses. It introduces a geometry-inspired knowledge quality metric , combining separation, information, and efficiency, to automatically select the most informative teacher layers and improve distillation. Across CNNs and ViTs on CIFAR-10/100 and Tiny ImageNet, the method achieves up to 15% top-1 accuracy gains and demonstrates that removing logit losses coupled with -driven layer selection yields robust, state-of-the-art KD performance. The approach highlights the importance of latent geometry in knowledge transfer and suggests broader implications for layer selection and training regimes beyond KD.

Abstract

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.

Paper Structure

This paper contains 41 sections, 19 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Illustration of our feature knowledge distillation framework and its two key distinctions from prior work. First, while the status quo is to back-propagate logit-based losses through the student backbone ($\rightarrow$), our method back-propagates only CE through the just the classifier ($\rightarrow$). Second, while the default strategy is to select teacher layers to distill from the end of each 'stage' ($\rightarrow$), we introduce a metric for automatically selecting the layers with the highest knowledge quality which often occur within a single stage ($\rightarrow$).
  • Figure 2: Per-layer knowledge quality analysis of ResNet34 (left) and ViT_B (right) on CIFAR100. X-axes: layer indices. Y-axis: $\mathcal{S}$ (dark blue), $\mathcal{I}$ (light blue), $\mathcal{E}$ (red), $\mathcal{Q}$ (gray). Orange circles indicate standard layer selections and gray Xs indicate maximal knowledge quality layers.
  • Figure 3: Performance of proposed method and baselines. Vertical black lines denote baseline student performance and the end of each bar shows standard deviation values from three runs. Configurations which failed to converge are not plotted. ARI denotes the mean ARI from our method to all baselines.
  • Figure 4: Performance of different teacher layer selection methods when paired with three loss recipes: our loss recipe ( Orange), CE loss used in backbone ( Light blue), and both CE and KL loss used in backbone ( Dark blue). Configurations which failed to converge are clipped to $-3$ for improved legibility. Ours and standard layer selection are indicated by solid and striped bars, respectively.
  • Figure 5: Relationship between the intrinsic, embedding, and ambient dimensions (ID, ED, AD). The blue circle has ID $1$, because it is a 1-dimensional manifold. However, it has ED 2 (dashed lines), because it cannot exist in $R^n$, when $n < 2$. Yet, the circle is drawn in AD 3. Generally, $ID \leq ED \leq AD$.
  • ...and 11 more figures