Table of Contents
Fetching ...

Distillation of Large Language Models via Concrete Score Matching

Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, Il-Chul Moon

TL;DR

This work tackles the problem of distilling large autoregressive language models by addressing two limitations of standard KD: softmax-induced smoothing of teacher knowledge and the restrictive solution set under direct logit matching. It introduces Concrete Score Distillation (CSD), a discrete score-matching objective that aligns logit ratios across vocabulary pairs via a log-ratio loss, stabilized by a log transform and flexible pairwise weighting. The authors provide theoretical guarantees showing CSD’s consistency and a solution superset relative to direct logit distillation, along with an efficient $\mathcal{O}(|\mathcal{V}|)$ gradient computation. Empirical results across task-agnostic and task-specific distillation on models up to 7B parameters demonstrate that CSD consistently outperforms baselines and offers tunable fidelity-diversity trade-offs, with complementary gains when combined with on-policy methods, indicating strong practical impact for scalable LLM distillation.

Abstract

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation.

Distillation of Large Language Models via Concrete Score Matching

TL;DR

This work tackles the problem of distilling large autoregressive language models by addressing two limitations of standard KD: softmax-induced smoothing of teacher knowledge and the restrictive solution set under direct logit matching. It introduces Concrete Score Distillation (CSD), a discrete score-matching objective that aligns logit ratios across vocabulary pairs via a log-ratio loss, stabilized by a log transform and flexible pairwise weighting. The authors provide theoretical guarantees showing CSD’s consistency and a solution superset relative to direct logit distillation, along with an efficient gradient computation. Empirical results across task-agnostic and task-specific distillation on models up to 7B parameters demonstrate that CSD consistently outperforms baselines and offers tunable fidelity-diversity trade-offs, with complementary gains when combined with on-policy methods, indicating strong practical impact for scalable LLM distillation.

Abstract

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation.

Paper Structure

This paper contains 20 sections, 6 theorems, 24 equations, 9 figures, 11 tables, 1 algorithm.

Key Result

Proposition 0

(Consistency) Given context ${\mathbf{c}}$ and prefix ${\mathbf{y}}_{<t}$, assume sufficient model capacity. For any $w(\cdot,\cdot)>0$, define the set of optimal parameters as $\Theta_{\text{CSD}}^* = \mathop{\mathrm{arg\,min}}\limits_{\theta}{\mathcal{L}_\text{CSD}\left(\theta; p_{T},w\right)}$. T

Figures (9)

  • Figure 1: Motivation for logit-level distillation and limitations of prior work. (a) Statistics of per-token probabilities for every vocabulary for 16 input–output sequences from the teacher model (GPT-2-1.5B). The probabilities are highly sparse, with only 0.0023% being greater than 0.01. (b) Despite large differences in logits (e.g., $[-1, -4, 4]$ vs. $[1, -9, 6]$), softmax yields nearly identical probabilities and gradients. (c) Prior direct logit distillation restricts the solution set.
  • Figure 2: Schematic for $\mathcal{L}_{\text{CSD}}$ (\ref{['eq:csd']}).
  • Figure 3: An in-depth analysis of the distributional behavior of different loss functions.
  • Figure 4: GPT-4 feedback performance, showing the proportion of responses judged correct relative to the golden answers. The teacher’s score is 0.61.
  • Figure 5: Ablation between analytic gradient calculation (\ref{['eq:grad']}) and Monte Carlo sampling for the $(S,S)$ weighting $\mathcal{L}_{\text{CSD}}$ calculation.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Proposition 0
  • Theorem 1
  • Theorem 2
  • Proposition 2
  • proof
  • Theorem 2
  • proof
  • Theorem 2
  • proof