Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang; Weijie Liu; Ruobing Xie; Kai Yang; Saiyong Yang; Yankai Lin

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin

TL;DR

This work reframes on-policy distillation (OPD) as a special case of dense KL-constrained RL and generalizes it to Generalized On-Policy Distillation (G-OPD) by introducing a reward-scaling factor $\lambda$ and a flexible reference model $\pi_{\mathrm{ref}}$. The key idea is to control the balance between reward and regularization, enabling reward interpolation ($0<\lambda<1$) and reward extrapolation ($\lambda>1$); in particular, reward extrapolation (ExOPD) can push a student beyond the teacher’s capabilities, and reward-corrected references can further boost strong-to-weak distillation. Experiments on math reasoning and code generation show ExOPD consistently outperforms standard OPD and off-policy baselines, with multi-teacher distillation yielding a unified student that surpasses domain teachers. Reward correction in strong-to-weak distillation further improves accuracy, albeit with higher computational cost. Overall, the G-OPD framework offers a principled, scalable path to surpass teacher performance through controlled reward weighting and flexible reference modeling.

Abstract

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

TL;DR

This work reframes on-policy distillation (OPD) as a special case of dense KL-constrained RL and generalizes it to Generalized On-Policy Distillation (G-OPD) by introducing a reward-scaling factor

and a flexible reference model

. The key idea is to control the balance between reward and regularization, enabling reward interpolation (

) and reward extrapolation (

); in particular, reward extrapolation (ExOPD) can push a student beyond the teacher’s capabilities, and reward-corrected references can further boost strong-to-weak distillation. Experiments on math reasoning and code generation show ExOPD consistently outperforms standard OPD and off-policy baselines, with multi-teacher distillation yielding a unified student that surpasses domain teachers. Reward correction in strong-to-weak distillation further improves accuracy, albeit with higher computational cost. Overall, the G-OPD framework offers a principled, scalable path to surpass teacher performance through controlled reward weighting and flexible reference modeling.

Abstract

Paper Structure (20 sections, 24 equations, 6 figures, 7 tables)

This paper contains 20 sections, 24 equations, 6 figures, 7 tables.

Introduction
Related Work
Methodology
Preliminaries
Generalized On-Policy Distillation
Reward interpolation and extrapolation in G-OPD.
Reward correction in strong-to-weak distillation.
Experiments and Analysis
Experiments with Same-Sized Student and Teacher
Experimental Settings
Results of Single-Teacher Distillation
Results of Multi-Teacher Distillation
Experiments in the Strong-to-Weak Distillation Setting
Experimental Settings
Results of Strong-to-Weak Distillation
...and 5 more sections

Figures (6)

Figure 1: The empirical effectiveness of our method ExOPD compared with off-policy distillation (SFT), standard OPD, and the weight-extrapolation method ExPO expo in multi-teacher and strong-to-weak distillation settings (results averaged over 4 math reasoning and 3 code generation benchmarks). (a) When merging multiple domain experts—obtained by applying domain-specific RL to the same base model—back into the original base model, ExOPD is the only method that yields a unified student that consistently outperforms all domain teachers. (b) ExOPD also yields significant improvements over standard OPD when distilling a smaller student from a larger teacher. Moreover, applying reward correction in ExOPD can further boost distillation performance (Figure \ref{['fig: effect of reward correction']}).
Figure 2: On-policy distillation results on four math reasoning benchmarks under different choices of reward scaling factor $\lambda$.
Figure 3: On-policy distillation results on three code generation benchmarks under different choices of reward scaling factor $\lambda$.
Figure 4: Trends in the average number of tokens and the average accuracy of the on-policy distilled models across different benchmarks under varying reward scaling factors. The teacher for math reasoning tasks is Qwen3-4B-Non-thinking-GRPO-Math, while the teacher for code generation tasks is Qwen3-4B-Non-thinking-GRPO-Code.
Figure 5: Training dynamics of OPD and ExOPD in multi-teacher distillation experiments. We visualize using Exponential Moving Average (EMA) smoothing with a coefficient of 0.5.
...and 1 more figures

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

TL;DR

Abstract

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)