AdaSwitch: Adaptive Switching Generation for Knowledge Distillation
Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao
TL;DR
AdaSwitch introduces a token-level, two-stage KD framework that dynamically balances exploration and teacher guidance by switching from the student to the teacher based on an adaptive, divergence-based threshold. By leveraging a moving-average divergence $\bar{d_i}$ and threshold $\tau = K \cdot \bar{d}_{i-1}$, AdaSwitch preserves training–inference consistency while maintaining supervision quality, addressing the shortcomings of purely on-policy or off-policy methods. Across three tasks and two LLM pairs, AdaSwitch delivers consistent accuracy gains with modest overhead (approximately $1.2$–$1.4\times$ on KD) and robustness to different distance metrics, outperforming existing mixed KD strategies. This approach offers a practical, scalable path to distilling small language models without incurring additional inference-time costs.
Abstract
Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
