Table of Contents
Fetching ...

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Makoto Shing, Kou Misaki, Han Bao, Sho Yokoi, Takuya Akiba

TL;DR

TAID addresses the core challenge of compressing large language and vision-language models by bridging the teacher–student gap with a temporally adaptive interpolated distillation mechanism. It introduces a time-varying intermediate distribution $p_t$ that transitions from the student's own distribution toward the teacher's, with an adaptive update of the interpolation parameter to optimize learning. Theoretical analysis shows TAID avoids mode collapse under reasonable signal strength and step budgets, while empirical results demonstrate superior performance across instruction tuning and pre-training, including state-of-the-art compact models TAID-LLM-1.5B and TAID-VLM-2B. The approach yields robust knowledge transfer, improved training stability, and practical deployment benefits in resource-constrained settings.

Abstract

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce $\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: $\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

TL;DR

TAID addresses the core challenge of compressing large language and vision-language models by bridging the teacher–student gap with a temporally adaptive interpolated distillation mechanism. It introduces a time-varying intermediate distribution that transitions from the student's own distribution toward the teacher's, with an adaptive update of the interpolation parameter to optimize learning. Theoretical analysis shows TAID avoids mode collapse under reasonable signal strength and step budgets, while empirical results demonstrate superior performance across instruction tuning and pre-training, including state-of-the-art compact models TAID-LLM-1.5B and TAID-VLM-2B. The approach yields robust knowledge transfer, improved training stability, and practical deployment benefits in resource-constrained settings.

Abstract

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce , a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: for language tasks and for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

Paper Structure

This paper contains 48 sections, 3 theorems, 26 equations, 3 figures, 10 tables, 2 algorithms.

Key Result

Theorem 4.1

Suppose we run distillation for $T$ steps in total. If the teacher model has sufficiently large signals so that the label is at least as large as $\Omega(\sqrt{T\epsilon})$, then the student model does not collapse for any time $t$.

Figures (3)

  • Figure 1: Comparison of standard KD and TAID.(Left) Standard KD methods typically employ direct optimization towards a fixed teacher distribution. (Right) TAID creates a dynamic bridge through adaptive, time-dependent intermediate teacher distributions (green dashed lines), enabling gradual optimization of the student. This approach facilitates a flexible transition from the student's initial distribution towards the teacher's distribution over time, effectively addressing the capacity gap and balancing knowledge transfer across varying model sizes.
  • Figure 2: Analysis of TAID's behavior and performance.(Left) Interpolation parameter $t$ behavior: Higher $\alpha$ values lead to faster initial growth compared to linear increase, allowing for more aggressive knowledge transfer in early stages when the capacity gap is small. (Middle) Objective value comparison: TAID exhibits a more stable objective value with lower variance compared to standard KL divergence throughout training, indicating a consistent learning difficulty that aligns with the student's evolving capabilities. (Right) Performance across different teacher sizes: TAID shows monotonic improvement and outperforms other methods as teacher size increases, demonstrating its effectiveness in addressing the curse of capacity gap.
  • Figure 3: Comparison between image classification and language modeling tasks. Language modeling (GPT-2) exhibits significantly higher entropy and lower target-class probabilities compared to image classification (ResNet-56). These fundamental differences highlight the unique challenges in language model distillation.

Theorems & Definitions (7)

  • Definition 3.1: TAID Interpolated Distribution
  • Definition 3.2: TAID Objective
  • Theorem 4.1: Non-collapse Nature (Informally)
  • Theorem B.1
  • Corollary B.1.1
  • proof : Proof of Theorem \ref{['theorem:non_collapse']}
  • proof : Proof of Corollary \ref{['corollary:taid_non_collapse']}