Table of Contents
Fetching ...

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

Paper Structure

This paper contains 27 sections, 10 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Training dynamics and diagnostic analysis on Chemistry with Qwen3-8B. (a) SDPO improves faster than GRPO in early training, but is later overtaken and collapses, whereas SRPO achieves both rapid initial improvement and stable long-horizon optimization. (b) Restricting SDPO updates to incorrect samples retains most of its overall benefit, whereas applying SDPO only to correct samples degrades performance and destabilizes training, supporting the necessity of sample routing. (c) The self-teacher's token-level entropy rises during training, indicating that the distillation signal becomes increasingly dominated by uncertain predictions. Curves show a 5-step rolling mean and shaded bands denote $\pm 1$ std.
  • Figure 2: Overview of SRPO. Given a prompt $x$, the policy $\pi_\theta$ generates a group of on-policy rollouts. A correctness check routes each rollout to one of two branches: correct samples are sent to the GRPO branch (top), where group-relative advantages provide a reward-aligned policy update; incorrect samples with available teacher information are sent to the SDPO branch (bottom), where a feedback-conditioned self-teacher produces logit-level distillation targets via $\mathrm{KL}(P\;\|\;\mathrm{stopgrad}(Q))$ for dense corrective supervision.
  • Figure 3: Training curves on three representative benchmarks for Qwen3-8B. We plot avg@16 against wall-clock training time on (a) Chemistry, (b) Biology, and (c) Tool Use. These curves complement Table \ref{['tab:main-results']}, which reports the highest achieved result within each training budget. All curves show a 5-step rolling mean and shaded bands denote $\pm 1$ std.
  • Figure 4: Response length and per-step compute time for Qwen3-8B. (a) Response length on Chemistry: GRPO remains consistently long, SDPO drops rapidly, and SRPO stays moderate. Curves show a 5-step rolling mean and shaded bands denote $\pm 1$ std. (b) Average seconds per training step, averaged across five benchmarks and measured over the 1h, 5h, and 10h windows. SRPO incurs a modest overhead relative to GRPO in the early stage of training, but becomes faster than both GRPO and SDPO over longer training horizons.
  • Figure 5: Routing statistics during SRPO training of Qwen3-8B in Chemistry. (a) Fraction of samples routed to the GRPO branch. (b) Fraction of samples routed to the SDPO branch. (c) Fraction of samples in each batch for which teacher information can be constructed. As training progresses, the policy improves and generates more correct rollouts, causing the SDPO fraction to decrease steadily while the GRPO fraction increases correspondingly. All curves show a 5-step rolling mean and shaded bands denote $\pm 1$ std.