Table of Contents
Fetching ...

Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu, Shangzhe Li, Xinhua Zhang

TL;DR

This work reframes language model distillation as a temporal-difference imitation-learning problem and exploits the intrinsic sparsity of language distributions by introducing a top-$p$ action subset. By defining a top-$p$ MDP and a corresponding top-$p$ soft Bellman operator, the authors prove contraction and near-optimality guarantees, enabling efficient TD-based distillation with reduced computational burden. They implement a practical Bellman Distill (BD) method based on IQL, using top-$p$ masking, policy projection, and $\chi^2$ regularization, and demonstrate strong, scalable gains across GPT-2, OPT, and Qwen-2.5 families in offline distillation settings, with faster training and robust Rouge-L and win-rate improvements. The approach provides a principled, plug-and-play framework for TD-style IL in large discrete action spaces and highlights the value of action-space priors in distillation.

Abstract

Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

Language Model Distillation: A Temporal Difference Imitation Learning Perspective

TL;DR

This work reframes language model distillation as a temporal-difference imitation-learning problem and exploits the intrinsic sparsity of language distributions by introducing a top- action subset. By defining a top- MDP and a corresponding top- soft Bellman operator, the authors prove contraction and near-optimality guarantees, enabling efficient TD-based distillation with reduced computational burden. They implement a practical Bellman Distill (BD) method based on IQL, using top- masking, policy projection, and regularization, and demonstrate strong, scalable gains across GPT-2, OPT, and Qwen-2.5 families in offline distillation settings, with faster training and robust Rouge-L and win-rate improvements. The approach provides a principled, plug-and-play framework for TD-style IL in large discrete action spaces and highlights the value of action-space priors in distillation.

Abstract

Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

Paper Structure

This paper contains 23 sections, 4 theorems, 16 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

$\mathcal{B}^\pi_p$ is a contraction in the supported $\infty$-norm.

Figures (4)

  • Figure 1: We average the sorted token probabilities across $20$ sequences generated by Qwen-2.5 3B. Top $50$ tokens account for $96\%$ of the total mass, and the top $7$ tokens contribute $\geq 90\%$.
  • Figure 2: Comparison of win rates against KD, SeqKD, and MiniLLM baselines. We evaluate using GPT-4o-mini openai2024gpt4omini as the judging oracle, with Qwen-2.5 (3B) as the teacher model and a smaller (0.5B) model as the student. Results are based on 500 responses per distilled model, generated under the Dolly evaluation setting.
  • Figure 3: Prompt wrapper for evaluation and data generation.
  • Figure 4: Prompt wrapper for win rate evaluation using GPT-4o-mini openai2024gpt4omini.

Theorems & Definitions (15)

  • Definition 1: top-$p$ candidate set
  • Definition 2: top-$p$ MDP and $\bar{Q}$
  • Definition 3: top-$p$ projection
  • Definition 4: top-$p$ soft Bellman operator
  • Definition 5: supported $q$-norm
  • Definition 6: fixed point of $\mathcal{B}^\pi_p$
  • Proposition 1: contraction
  • Proposition 2
  • Proposition 3: sandwich condition
  • Proposition 4: bounded sub-optimality
  • ...and 5 more