LoCa: Logit Calibration for Knowledge Distillation

Runming Yang; Taiqiang Wu; Yujiu Yang

LoCa: Logit Calibration for Knowledge Distillation

Runming Yang, Taiqiang Wu, Yujiu Yang

TL;DR

The paper identifies a mis-instruction problem in logit-based knowledge distillation, where teacher predictions can mislead the student relative to the ground-truth label. It proposes LoCa, a parameter-free logit calibration technique that enforces ground-truth correctness by scaling non-target logits and recomputing the target logit so the ground-truth class remains the top prediction, while preserving the relative proportions among non-target logits to retain useful dark knowledge, with $s=\alpha\cdot\sigma$ and $\sigma=\frac{1}{1-p_{gt}+p_{k_{logits}}}$. LoCa demonstrates consistent improvements over vanilla KD on image classification (CIFAR-100, ImageNet) and text generation (Dolly, S-NI, UnNI) across diverse teacher-student pairs, with minimal additional computation. The method also complements DKD, yielding further gains, and case studies show LoCa reduces hallucinations and grammatical errors in generated text. Overall, LoCa offers a robust, low-cost enhancement to logit-based distillation applicable to both vision and language tasks.

Abstract

Knowledge Distillation (KD), aiming to train a better student model by mimicking the teacher model, plays an important role in model compression. One typical way is to align the output logits. However, we find a common issue named mis-instruction, that the student would be misled when the predictions based on teacher logits do not follow the labels. Meanwhile, there is other useful dark knowledge in the logits such as the class discriminability, which is vital for distillation. In this paper, we propose a simple yet effective Logit Calibration (LoCa) method, which calibrates the logits from the teacher model based on the ground-truth labels. The key insight is to correct the prediction (to address the mis-instruction issue) and maintain useful dark knowledge simultaneously. Our proposed LoCa does not require any additional parameters. Empirical results on image classification and text generation tasks demonstrate that LoCa can effectively improve the performance of baselines.

LoCa: Logit Calibration for Knowledge Distillation

TL;DR

and

. LoCa demonstrates consistent improvements over vanilla KD on image classification (CIFAR-100, ImageNet) and text generation (Dolly, S-NI, UnNI) across diverse teacher-student pairs, with minimal additional computation. The method also complements DKD, yielding further gains, and case studies show LoCa reduces hallucinations and grammatical errors in generated text. Overall, LoCa offers a robust, low-cost enhancement to logit-based distillation applicable to both vision and language tasks.

Abstract

Paper Structure (26 sections, 18 equations, 4 figures, 10 tables)

This paper contains 26 sections, 18 equations, 4 figures, 10 tables.

Preliminary
Logit-based KD
Analysis on Mis-instruction Issue
Methodology
Optimization Objective
LoCa: Logit Calibration
Experiments
Image Classification Tasks
Datasets and Models
Baselines and Implementation
Main Results
Text Generation Tasks
Datasets and Models
Baselines and Implementation
Main Results
...and 11 more sections

Figures (4)

Figure 1: Mis-Instruction ratio of various teacher models on ImageNet training set. For all models, the ratios are greater than 17.5%.
Figure 2: The details of the proposed LoCa method. During distillation, we first calibrate the logits for the mis-instruction samples and then employ the calibrated logits for KD. Specifically, we introduce a scaling factor $\alpha$ to decrease the non-target logits and increase the target logit.
Figure 3: Rouge-L scores on Dolly, S-NI, and UnNI datasets. We report the average and standard deviation scores for 5 trials. Our proposed LoCa outperforms KD on all benchmarks.
Figure 4: The ablation studies under different $\alpha$ settings in our LoCa. We employ ResNet32$\times$4 and ResNet8$\times$4 as the teacher and the student on CIFAR-100 (part a). We set ResNet-34 to ResNet-18 as the teacher and student on ImageNet (part b).

LoCa: Logit Calibration for Knowledge Distillation

TL;DR

Abstract

LoCa: Logit Calibration for Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)