SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
TL;DR
This work reframes knowledge distillation for LLMs by addressing the inefficiency of uniform token-wise supervision, especially with large teacher–student capacity gaps. It introduces SelecTKD, a plug-and-play selective token-weighted distillation framework that uses a propose-and-verify mechanism (Top-$k$ greedy and Spec-$k$ non-greedy) to determine which tokens to learn from, with accepted tokens receiving full loss and rejected ones down-weighted by $eta$. The framework induces an implicit curriculum via Token Acceptance Rate (TAR) and is objective-agnostic, improving performance across instruction following, math, code generation, and vision-language tasks, including state-of-the-art results for small models and faster speculative decoding. Theoretical analysis links TAR to monotonic improvement, smoother loss landscapes, and better generalization, while extensive experiments demonstrate robust gains across diverse baselines and data regimes, highlighting SelecTKD’s practical impact for deploying compact, capable models.
Abstract
Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.
