Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Wenda Xu; Rujun Han; Zifeng Wang; Long T. Le; Dhruv Madeka; Lei Li; William Yang Wang; Rishabh Agarwal; Chen-Yu Lee; Tomas Pfister

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

TL;DR

Speculative Knowledge Distillation (SKD) introduces interleaved teacher–student sampling to address KD gaps between large teachers and smaller students. By evaluating student-proposed tokens against the teacher’s top-$K$ distribution and adaptively transitioning from supervised-like to on-policy-like training, SKD mitigates low-quality samples and distribution mismatch. The approach yields consistent improvements across translation, dialogue summarization, arithmetic reasoning, and math instruction following, across different initializations and data regimes, while enabling faster speculative-decoding-based inference. This end-to-end KD framework offers robust performance, practical speedups, and broad Applicability to task-specific and task-agnostic distillation scenarios in language-model settings.

Abstract

Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

TL;DR

Abstract

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)