Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Yuxin Ren; Zihan Zhong; Xingjian Shi; Yi Zhu; Chun Yuan; Mu Li

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Yuxin Ren, Zihan Zhong, Xingjian Shi, Yi Zhu, Chun Yuan, Mu Li

TL;DR

This work tackles the paradox that a higher-quality teacher does not always yield a better student in knowledge distillation. It introduces distillation influence, a per-sample metric that predicts how much distillation on a training example will improve validation performance, and builds Learning Good Teacher Matters (LGTM) to reweight training samples during teacher updates. The method uses a finite difference approximation to efficiently compute influence and adds a teacher auxiliary loss to balance self-evolution with transfer, enabling more personalized guidance. Empirical results on GLUE show LGTM outperforms 10 KD baselines across 6 text classification tasks, with analyses demonstrating meaningful sample-level weighting and improved generalization. Overall, LGTM provides a principled, scalable way to adapt teacher training to the student's learning progress, enhancing knowledge transfer in NLP KD setups.

Abstract

It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student's generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher's learning process. By prioritizing samples that are likely to enhance the student's generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

TL;DR

Abstract

Paper Structure (35 sections, 20 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 20 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Notations
Revisiting Learning to Teach
Vanilla distillation
Online distillation
Meta distillation
Methods
Distillation influence
Finite difference approximation
Teacher’s auxiliary loss
Relationship with other L2T methods
Experiments
Experimental Setup
Datasets
Baselines
...and 20 more sections

Figures (6)

Figure 1: Comparison of vanilla distillation, online distillation, meta distillation and our proposed LGTM. The dotted orange lines show the direction of the gradient flow for model update. Note that vanilla distillation and meta distillation employ a two-stage training pipeline by first fine-tuning the teacher on the target task. Online distillation and LGTM employ a one-stage joint training strategy for both teacher and student.
Figure 2: Performance comparison between Meta Distill zhou2022bert and LGTM on the MNLI validation set. We observe that for LGTM, student model does not suffer from overfitting (thanks to distillation influence), and the teacher can balance its own evolution and effective knowledge transfer (thanks to auxiliary loss).
Figure 3: We select two samples in the MRPC dataset to visualize their trends of the distillation influence during training. We also visualize the relationship between the distillation influence and the predictions from the student and the teacher. Left: our method assigns negative weight to a potential difficult sample, which helps avoid overfitting. Right: our method assigns positive weight to a potential easy sample, which encourages model learning.
Figure 4: We visualize the trend of the distillation influence from 64 random samples in the MRPC dataset. We find that whether assigning positive or negative weight, the trend is similar. Distillation influence is usually insignificant in the beginning and end of the training, but fluctuates in the middle. We hypothese this is because our method is assigning varying weights to each sample during training, with the goal of filtering difficult samples and focusing on samples better for generalization.
Figure 5: The entropy gap between the teacher and the student on the SST-2 training set for two-stage and one-stage training strategies. We only keep the loss with respect to ground truth labels in \ref{['eqn:teacher']} to train the teacher. We follow shi2020learning to initialize both the teacher and student's classifier as zero in the one-stage setting.
...and 1 more figures

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

TL;DR

Abstract

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)