DistiLLM: Towards Streamlined Distillation for Large Language Models

Jongwoo Ko; Sungnyun Kim; Tianyi Chen; Se-Young Yun

DistiLLM: Towards Streamlined Distillation for Large Language Models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun

TL;DR

DistiLLM tackles inefficiencies in distilling autoregressive LMs by introducing a theoretically grounded skew KL objective and an adaptive off-policy framework for SGOs. The SKL/SRKL losses provide stable gradients and bounded approximation error, while the adaptive SGO scheduler and replay buffer boost sample efficiency and reduce training time. Empirical results across instruction-following, summarization, and translation show state-of-the-art performance for smaller student models with substantial speedups over prior KD methods. The work offers a practical, generalizable distillation recipe that scales to larger LLMs and reduces computational burden without sacrificing effectiveness.

Abstract

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

DistiLLM: Towards Streamlined Distillation for Large Language Models

TL;DR

Abstract

speedup compared to recent KD methods.

Paper Structure (54 sections, 6 theorems, 25 equations, 16 figures, 15 tables, 1 algorithm)

This paper contains 54 sections, 6 theorems, 25 equations, 16 figures, 15 tables, 1 algorithm.

Introduction
Contributions.
Background
KD for Auto-regressive Generative LMs
Pitfalls of Existing Distillation
Limitation of objective functions.
Limitations of utilizing SGO.
DistiLLM
Skew (Reverse) KLD
Stable gradient.
Small approximation error.
Adaptive Off-policy Approach
Adaptive SGO scheduler.
Off-policy approach for sample efficiency.
Synergy with SKL.
...and 39 more sections

Key Result

Theorem 1

Let $p^{1}_{n}$ and $p^{2}_{n}$ be empirical distributions of $n$ i.i.d. samples from $p^{1}$ and $p^{2}$, respectively. Under mild assumptions, we have an upper bound for the L2 norm of $\alpha$-SKL estimator $D_{\text{SKL}}^{(\alpha)}(p^{1}_n, p^{2}_n)$ for $D_{\text{SKL}}^{(\alpha)}(p^{1}, p^{2}) for $c_{1}(\alpha)=\min\left\{\frac{1}{\alpha^{2}}, \frac{\chi^{2}(p^{1}, p^{2})^{2}}{(1-\alpha)^{2

Figures (16)

Figure 1: Examples of SGOs from GPT-2 (student) and their corresponding validation loss by GPT-2 XL (teacher). Since the teacher model may not be familiar with the SGO, using $p(\mathbf{y} | \mathbf{x})$ as a target distribution can misguide the student model, as shown in Tab. \ref{['tab:genfilt']}.
Figure 1: Evaluation of the effect of SKL and SRKL loss functions. Bold and underline indicate the best and second-best results, respectively, among those from the same evaluation dataset. We report the average and standard deviation of ROUGE-L scores across five random seeds. Green ($\bullet$) and red ($\bullet$) arrows indicate improvement and deterioration over the corresponding baselines.
Figure 2: (Left): Normalized runtime according to the maximum response length of SGOs with GPT-2 XL teacher and GPT-2 student. (Right): Normalized runtime for various sizes of teacher and student models with a response length of 256. FWD and BWD denote forward and backward propagation, respectively.
Figure 2: Evaluation of the adaptive off-policy approach. We report the average and standard deviation of ROUGE-L across five random seeds. The best and second best performances are highlighted bold and underline. Green ($\bullet$) and red ($\bullet$) arrows indicate improvement and deduction over the baselines.
Figure 3: (a)-(b): Gradient coefficient distribution for SKL and SRKL across different skew values $\alpha$, as shown in Eq. \ref{['eq:grad_skl']}--\ref{['eq:grad_srkl']}. (c): Distribution of differences between divergence values and their (exponential) moving average of $\alpha$-S(R)KL, as shown in Thm. \ref{['method:thm']}, and those of $\beta$-JSD by substituting SKL into JSD across different $\alpha$ and $\beta$, respectively. (d): Normalized L2 norm distribution, dividing the L2 norm in (c) by corresponding gradient coefficient values.
...and 11 more figures

Theorems & Definitions (10)

Theorem 1
Remark 1
Lemma 2.1: liu2021divergence
Lemma 2.2: liu2021divergence
Lemma 2.3: lee2022renyicl
proof
Lemma 2.4: rubenstein2019practical
Lemma 2.5: lee2022renyicl
proof
proof

DistiLLM: Towards Streamlined Distillation for Large Language Models

TL;DR

Abstract

DistiLLM: Towards Streamlined Distillation for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (10)