Table of Contents
Fetching ...

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song, Mao Zheng

Abstract

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

A Survey of On-Policy Distillation for Large Language Models

Abstract

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified -divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

Paper Structure

This paper contains 60 sections, 36 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Forward KL vs. Reverse KL divergence for fitting a student distribution $P_S$ to a bimodal teacher $P_T$. (a) The teacher distribution with two modes. (b) Forward KL is mode-covering (zero-avoiding): the student covers both modes but places mass in the inter-mode "hallucination zone." (c) Reverse KL is mode-seeking (zero-forcing): the student concentrates on a single peak, dropping the other mode entirely. Adaptive methods (ToDi, Entropy-Aware OPD) dynamically switch between the two based on teacher confidence.
  • Figure 2: Taxonomy of on-policy distillation for large language models. The methodology space is organized along three orthogonal dimensions with representative methods. A single method may appear across multiple categories as these axes are independent.