Table of Contents
Fetching ...

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, Vladimir Braverman

Abstract

On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Abstract

On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

Paper Structure

This paper contains 27 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Abrupt length inflation within OPD.
  • Figure 2: Training dynamics of OPD on three groups. Training starts in a stable regime with low truncation and repetition, followed by a sharp phase transition where truncation and repetition increase and remain high while validation accuracy collapses, illustrating a robust truncation-repetition inflation failure mode of OPD.
  • Figure 3: Rollout-level evidence of abrupt repetition inflation for three student-teacher groups. Around the step where rollout length abruptly inflates, both student and teacher log_prob become much less negative, with the teacher's increase being larger, which induces a sudden jump in the reveser KL advantage.
  • Figure 4: Reverse-KL advantage for regular and repetitive tokens during OPD training. Repetitive tokens receive larger advantages than regular tokens throughout training.
  • Figure 5: Training dynamics of OPD vs. Stable-OPD. Student: Qwen2.5-Math-1.5B; Teacher: OpenThinker3-7B.
  • ...and 2 more figures