Table of Contents
Fetching ...

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao

Abstract

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Abstract

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

Paper Structure

This paper contains 43 sections, 29 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Effect of increasing $\gamma$ in the toy experiment. Larger $\gamma$ yields a higher and more persistent variance regime and, in the sequence-level limit, drifting policies in state space.
  • Figure 2: Scatter of token probabilities (student vs teacher). Sampled-token OPD at the first training iteration on Qwen2.5-7B-It qwen2025qwen25, using OpenThinker3-7B openthoughts2025 as the teacher model. The sampled-token signal is heavily skewed toward penalizing the current student token rather than providing a balanced reward.
  • Figure 3: The student falls into a repetition loop, but the teacher model maintains highly aligned with the student model on the repeating tokens, indicating a lack of proper penalty for such behavior.
  • Figure 4: Distribution of teacher-student log-probability gaps across token positions. Later positions show wider distributions and more extreme values, indicating a noisier teacher signal on long student-generated rollouts.
  • Figure 5: Token-level comparison can penalize semantically correct outputs due to tokenizer mismatch.
  • ...and 10 more figures