SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung
TL;DR
The paper tackles the challenge of knowledge distillation for autoregressive LLMs, where SGOs can lead to teacher misguidance, especially in long sequences with large teacher–student capacity gaps. It introduces SWITCH, which uses the Jensen-Shannon divergence between the teacher and student next-token distributions to selectively invoke the teacher during sequence generation, with an exponentially decaying threshold $\tau_t = \tau_0 e^{-\lambda t}$ that increases teacher involvement over time. Empirical results across GPT-2, OPT, and OpenLLaMA-2 families on five instruction-following datasets show SWITCH achieving state-of-the-art performance, with pronounced gains when the student is substantially smaller than the teacher and when generating long outputs; ablations demonstrate the superiority of the exponential-decay strategy over linear or constant schemes and other intervention strategies. Overall, SWITCH provides a practical, robust approach to improving KD for autoregressive LLMs by mitigating teacher misguidance while preserving the benefits of student-generated learning for long-sequence generation.
Abstract
Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) as training data being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student's sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.
