Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation
Yuxin Ren, Zihan Zhong, Xingjian Shi, Yi Zhu, Chun Yuan, Mu Li
TL;DR
This work tackles the paradox that a higher-quality teacher does not always yield a better student in knowledge distillation. It introduces distillation influence, a per-sample metric that predicts how much distillation on a training example will improve validation performance, and builds Learning Good Teacher Matters (LGTM) to reweight training samples during teacher updates. The method uses a finite difference approximation to efficiently compute influence and adds a teacher auxiliary loss to balance self-evolution with transfer, enabling more personalized guidance. Empirical results on GLUE show LGTM outperforms 10 KD baselines across 6 text classification tasks, with analyses demonstrating meaningful sample-level weighting and improved generalization. Overall, LGTM provides a principled, scalable way to adapt teacher training to the student's learning progress, enhancing knowledge transfer in NLP KD setups.
Abstract
It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student's generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher's learning process. By prioritizing samples that are likely to enhance the student's generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
