Table of Contents
Fetching ...

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron

Abstract

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Abstract

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.
Paper Structure (57 sections, 21 equations, 17 figures, 8 tables, 1 algorithm)

This paper contains 57 sections, 21 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: Performance of Reopold. (a) Sample Efficiency: Reopold achieves a state-of-the-art trade-off between accuracy and sample efficiency on the AIME-25 benchmark. Detailed explanation can be found in Section \ref{['sec:math']}. (b) Test-time Scaling: On visual reasoning tasks, Reopold demonstrates superior test-time scaling capabilities compared to the vanilla RKL baseline. Notably, it allows smaller models to approach the performance of the 32B teacher. Detailed explanation can be found in Section \ref{['sec:visual']}.
  • Figure 2: Illustration of Reopold. While standard on-policy distillation (a) often introduces instability and inefficiency by forcing the student to mimic the teacher excessively, our proposed Reopold (b) fosters a more stable and effective learning environment. By establishing a formal connection between distillation and RL via a stop-gradient operation (Section \ref{['sec:sg']}), Reopold utilizes teacher signals temperately (Section \ref{['sec:reward_clipping']}) and selectively (Section \ref{['sec:token-level-dynamic-sampling']}). As depicted, this approach filters out potentially harmful signals, preventing the student from deviating excessively from its original distribution.
  • Figure 3: Comparison of training dynamics between vanilla RKL and RKL with stop-gradient. (a) Training loss dynamics exhibit similar trends, aligning with the theoretical equivalence in Remark \ref{['remark:stop-grad']}. (b) The gradient norm is markedly lower and more stable when stop-gradient is applied, which (c) leads to superior validation performance. This confirms that treating the log-likelihood ratio as a fixed reward signal is beneficial for optimization stability.
  • Figure 4: Log-scale histogram of token-level log-likelihood ratio rewards. The red dashed ellipse indicates degenerate near-zero rewards, while the purple dashed region highlights the heavy tail of negative rewards. These distributions pose challenges for RL-based on-policy distillation by causing gradient vanishing and training instability, respectively.
  • Figure 5: Correlation between token entropy and log-likelihood ratio rewards. Experimental results on math reasoning and visual reasoning benchmarks demonstrate that rewards in the bottom 60th entropy percentile are heavily concentrated around zero. This suggests that while teacher and student policies may diverge overall, they remain highly consistent on low-entropy, deterministic tokens, with significant deviations occurring primarily in high-entropy regimes.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Remark 3.1