Table of Contents
Fetching ...

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren

TL;DR

This work reframes knowledge distillation for LLMs by addressing the inefficiency of uniform token-wise supervision, especially with large teacher–student capacity gaps. It introduces SelecTKD, a plug-and-play selective token-weighted distillation framework that uses a propose-and-verify mechanism (Top-$k$ greedy and Spec-$k$ non-greedy) to determine which tokens to learn from, with accepted tokens receiving full loss and rejected ones down-weighted by $eta$. The framework induces an implicit curriculum via Token Acceptance Rate (TAR) and is objective-agnostic, improving performance across instruction following, math, code generation, and vision-language tasks, including state-of-the-art results for small models and faster speculative decoding. Theoretical analysis links TAR to monotonic improvement, smoother loss landscapes, and better generalization, while extensive experiments demonstrate robust gains across diverse baselines and data regimes, highlighting SelecTKD’s practical impact for deploying compact, capable models.

Abstract

Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

TL;DR

This work reframes knowledge distillation for LLMs by addressing the inefficiency of uniform token-wise supervision, especially with large teacher–student capacity gaps. It introduces SelecTKD, a plug-and-play selective token-weighted distillation framework that uses a propose-and-verify mechanism (Top- greedy and Spec- non-greedy) to determine which tokens to learn from, with accepted tokens receiving full loss and rejected ones down-weighted by . The framework induces an implicit curriculum via Token Acceptance Rate (TAR) and is objective-agnostic, improving performance across instruction following, math, code generation, and vision-language tasks, including state-of-the-art results for small models and faster speculative decoding. Theoretical analysis links TAR to monotonic improvement, smoother loss landscapes, and better generalization, while extensive experiments demonstrate robust gains across diverse baselines and data regimes, highlighting SelecTKD’s practical impact for deploying compact, capable models.

Abstract

Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.

Paper Structure

This paper contains 36 sections, 1 theorem, 44 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Assume a sufficiently small learning rate $\eta > 0$ and that the language model satisfies standard Lipschitz continuity conditions cisse2017parseval. Then, for both SelecTKD variants (Greedy Top-$k$ and Non-Greedy Spec-$k$), each gradient update step improves the Token Acceptance Rate (TAR), with t where $\kappa > 0$ is a positive constant that reflects the average ease of correcting a rejected p

Figures (5)

  • Figure 1: Performance comparison of different loss functions and training datasets on DistiLLM-2 Ko2025 and DistiLLM-2 with SelecTKD, using Qwen2-7B-Inst as the teacher and Qwen2-1.5B as the student. The chart shows Win Rate on the Evol-Instruct benchmark evaluated with GPT-4o. The baseline is gpt-3.5-turbo. This comprehensive comparison highlights the impact of various loss/data combinations and demonstrates the consistent advantage of our SelecTKD method.
  • Figure 2: Overview of SelecTKD, a "propose-and-verify" framework for selective token-level knowledge distillation. Left: Dataset Generation. Both teacher $p(\boldsymbol{x},\boldsymbol{y})$ and student $q_\theta(\boldsymbol{x},\boldsymbol{y})$ can produce responses, enabling on- and off-policy data. Middle: Model Prediction and Token Sampling. Two variants are supported: (a) Greedy Top-$k$: the student proposes a greedy token $\hat{y}_t$ (argmax) and the teacher returns its Top-$k$ set; (b) Non-greedy Spec-$k$: the student samples $k$ candidate tokens and the teacher is queried only on these candidates to compute acceptance indices following speculative sampling leviathan2023fastchen2023accelerating. Right: Verification and Loss Computation. A token is accepted if it is in the teacher's Top-$k$ (greedy) or passes the speculative acceptance test (non-greedy). Accepted tokens receive full loss; rejected tokens are masked or down-weighted by $\beta$. The loss is objective-agnostic and works with KL, RKL, SKL, and SRKL. This design focuses learning on reliable, teacher-aligned tokens and yields a stable, curriculum-like training dynamic.
  • Figure 3: Performance analysis of SelecTKD in Qwen2-1.5B: (a) Comparison of different distillation methods during training; (b) Robustness of SelecTKD to increasingly larger teachers.
  • Figure 4: Analysis of SelecTKD dynamics: (a) validation loss generally decreases as the Token Acceptance Rate (TAR) increases, indicating an implicit curriculum; (b) SelecTKD yields a flatter loss landscape li2018visualizing than strong baselines (e.g., DistiLLM-2), which correlates with improved generalization.
  • Figure 5: Performance comparison of symmetric and asymmetric loss function combinations on Qwen2-7B-Inst $\rightarrow$ Qwen2-1.5B-SFT across four different evaluation platforms. The chart shows the Win Rate of four loss function combinations: two symmetric forms (SKL+SKL, SRKL+SRKL) and two asymmetric forms (SKL+SRKL, SRKL+SKL). Results are evaluated using DeepSeek (deepseek-chat), Qwen (qwen-plus-2025-01-25), OpenAI (gpt-4o), and Kimi (moonshot-v1-8k) as judges. The consistent performance patterns across platforms indicate that loss function geometry is not the dominant factor in determining final student performance, supporting our reframing of the distillation problem from "how to measure divergence" to "where to apply supervision."

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • proof : Convergence of Forward KL
  • proof : Convergence of Reverse KL
  • proof : Convergence of Skew KL (SKL)
  • proof : Convergence of Skew Reverse KL (SRKL)