Table of Contents
Fetching ...

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan

Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Self-Distilled RLVR

Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Paper Structure

This paper contains 43 sections, 8 theorems, 33 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

The OPSD objective and the ideal objective satisfy the identity: where $I(Y_t; R \mid X, Y_{<t})$ denotes the conditional mutual information between the current token $Y_t$ and the privileged information $R$ under the teacher distribution. $\blacktriangleleft$$\blacktriangleleft$

Figures (6)

  • Figure 1: Performance on the trained Qwen3-VL-8B-Instruct model. In (a), OPSD reaches its peak performance early and degrades, whereas RLSD inherits the training stability of GRPO while achieving a higher convergence ceiling. In (b), GRPO and RLSD report results at 200 training steps, while GRPO (2$\times$steps) reports results at 400 steps; RLSD at 200 steps already surpasses GRPO trained for twice as many steps, demonstrating faster convergence.
  • Figure 2: A representative example illustrating the privileged information leakage exhibited by the OPSD-trained model, where the model appeals to an invisible reference solution during inference.
  • Figure 3: Leakage, KL divergence, and validation performance of OPSD and its ablated variants.
  • Figure 4: An overview of our RLSD method.
  • Figure 5: Training dynamics on the multimodel reasoning tasks.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1: KL Decomposition
  • Proposition 1: Per-Sample Gradient Decomposition
  • Proposition 2: Capacity Ceiling
  • Theorem 2: Training Instability Under Online Teacher
  • Proposition 3: Self-Reinforcing Feedback Loop
  • Theorem 3: Impossibility Trilemma
  • Theorem 4: RLSD Weights as Belief Update Ratios
  • proof
  • Theorem 5: Leakage-Free Training Under RLSD