Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Nicholas Cooper; Lijun Chen; Sailesh Dwivedy; Danna Gurari

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari

TL;DR

The paper tackles the limitations of logit-based supervision in feature knowledge distillation by proposing a logit-free FKD framework that trains the student backbone exclusively via feature-based losses. It introduces a geometry-inspired knowledge quality metric $\mathcal{Q}$, combining separation, information, and efficiency, to automatically select the most informative teacher layers and improve distillation. Across CNNs and ViTs on CIFAR-10/100 and Tiny ImageNet, the method achieves up to 15% top-1 accuracy gains and demonstrates that removing logit losses coupled with $\mathcal{Q}$-driven layer selection yields robust, state-of-the-art KD performance. The approach highlights the importance of latent geometry in knowledge transfer and suggests broader implications for layer selection and training regimes beyond KD.

Abstract

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

TL;DR

Abstract

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)