Table of Contents
Fetching ...

AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

TL;DR

AdaSwitch introduces a token-level, two-stage KD framework that dynamically balances exploration and teacher guidance by switching from the student to the teacher based on an adaptive, divergence-based threshold. By leveraging a moving-average divergence $\bar{d_i}$ and threshold $\tau = K \cdot \bar{d}_{i-1}$, AdaSwitch preserves training–inference consistency while maintaining supervision quality, addressing the shortcomings of purely on-policy or off-policy methods. Across three tasks and two LLM pairs, AdaSwitch delivers consistent accuracy gains with modest overhead (approximately $1.2$–$1.4\times$ on KD) and robustness to different distance metrics, outperforming existing mixed KD strategies. This approach offers a practical, scalable path to distilling small language models without incurring additional inference-time costs.

Abstract

Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.

AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

TL;DR

AdaSwitch introduces a token-level, two-stage KD framework that dynamically balances exploration and teacher guidance by switching from the student to the teacher based on an adaptive, divergence-based threshold. By leveraging a moving-average divergence and threshold , AdaSwitch preserves training–inference consistency while maintaining supervision quality, addressing the shortcomings of purely on-policy or off-policy methods. Across three tasks and two LLM pairs, AdaSwitch delivers consistent accuracy gains with modest overhead (approximately on KD) and robustness to different distance metrics, outperforming existing mixed KD strategies. This approach offers a practical, scalable path to distilling small language models without incurring additional inference-time costs.

Abstract

Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.

Paper Structure

This paper contains 22 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: An example illustrating the sequence generation process for off-policy, on-policy, and SKD methods. denotes the teacher, and represents the student.
  • Figure 2: Overview of the proposed AdaSwitch approach. When the divergence between the student and teacher logits for the next token exceeds $\tau = K \cdot \bar{d}_i$, AdaSwitch switches to the teacher to generate the remaining tokens. denotes the teacher, and represents the student.
  • Figure 3: Further comparison of performance under different distance metrics on three tasks.
  • Figure 4: Analysis of the switch rate and the KLD at the switch token between the student and teacher models throughout the distillation process. Statistics were collected every 100 steps for the first 1000 steps.
  • Figure 5: Performance on the validation and test sets during the early stages of distillation.
  • ...and 4 more figures