Table of Contents
Fetching ...

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

TL;DR

Speculative Knowledge Distillation (SKD) introduces interleaved teacher–student sampling to address KD gaps between large teachers and smaller students. By evaluating student-proposed tokens against the teacher’s top-$K$ distribution and adaptively transitioning from supervised-like to on-policy-like training, SKD mitigates low-quality samples and distribution mismatch. The approach yields consistent improvements across translation, dialogue summarization, arithmetic reasoning, and math instruction following, across different initializations and data regimes, while enabling faster speculative-decoding-based inference. This end-to-end KD framework offers robust performance, practical speedups, and broad Applicability to task-specific and task-agnostic distillation scenarios in language-model settings.

Abstract

Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

TL;DR

Speculative Knowledge Distillation (SKD) introduces interleaved teacher–student sampling to address KD gaps between large teachers and smaller students. By evaluating student-proposed tokens against the teacher’s top- distribution and adaptively transitioning from supervised-like to on-policy-like training, SKD mitigates low-quality samples and distribution mismatch. The approach yields consistent improvements across translation, dialogue summarization, arithmetic reasoning, and math instruction following, across different initializations and data regimes, while enabling faster speculative-decoding-based inference. This end-to-end KD framework offers robust performance, practical speedups, and broad Applicability to task-specific and task-agnostic distillation scenarios in language-model settings.

Abstract

Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

Paper Structure

This paper contains 48 sections, 3 equations, 11 figures, 16 tables, 1 algorithm.

Figures (11)

  • Figure 1: SKD outperforms supervised and on-policy KD for our tested tasks: Assamese-to-English translation, dialogue summarization, and arithmetic reasoning. For teacher models, we employ supervised FT Gemma-7B-it and Qwen-7B-it as teacher models (it means instruction-tuned); for student models, we use Gemma-2B and Qwen-0.5B (either instruction-tuned or supervised FT student models checkpoints depending on the best performance). Supervised KD is trained on ground-truth outputs, while on-policy KD uses self-generated data. All models use greedy decoding for evaluation.
  • Figure 2: Overview of Speculative Knowledge Distillation (SKD) on an arithmetic reasoning task. Left: SKD addresses the limitations of on-policy knowledge distillation (KD) by filtering out low-quality student samples and replacing them with teacher generated tokens. Right: illustration of how SKD generalizes to both supervised KD (replacing with all teacher tokens) and on-policy (accepting all student tokens).
  • Figure 3: SKD outperforms baseline KDs at four held-out testing sets (Math, GSM$_{plus}$, SVAMP and ASDiv). We consider two data size setting, 1k and 10k respectively and report testing accuracy in the first row. In the second row, we calculate the performance gain of SKD over SFT and report the relative performance gains of baselines compared to SKD. As SKD outperforms all baselines in most cases, these values typically range from 0 to 1. In some instances, supervised KD may be outperformed by SFT, resulting in negative values. In the middle figure at the second row, we show that SKD outperforms all baselines under $7$ math concepts at math testing set. SKD is performed under $K$=$25$.
  • Figure 4: Comparison between SKD and baseline KD methods with different model initializations. We find that on-policy methods struggles when the student model starts with a poor initialization, leading to performance degradation and becoming stuck at a low level throughout training. On-policy KD requires student model to have a good initialization. In contrast, SKD outperforms supervised and on-policy KD under both IT and supervised FT Gemma-2B initialization. On-policy KD's low quality sample can be found in Appendix \ref{['sec:bad_case_study']}. SKD is performed under $K$=$25$.
  • Figure 5: Ablation study. We conduct an ablation study by training the student model with supervised KD for the first half and self-generated samples for the second half. Our results demonstrate that this mixed training strategy can mostly be outperformed by either supervised KD or on-policy KD across three tasks and under two model initialization. Moreover, it is significantly outperformed by our SKD. We excluded on-policy KD from the leftmost figure due to its extremely low COMET score.
  • ...and 6 more figures