RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Linxuan Xia; Xiaolong Yang; Yongyuan Chen; Enyue Zhao; Deng Cai; Yasheng Wang; Boxi Wu

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Linxuan Xia, Xiaolong Yang, Yongyuan Chen, Enyue Zhao, Deng Cai, Yasheng Wang, Boxi Wu

TL;DR

This work addresses the challenge of aligning large language models with domain knowledge while preserving broad reasoning capabilities. It introduces Rephrasing Policy Optimization (RePO), a two-stage framework that first internalizes off-policy knowledge via rephrasing into the model’s own style and then dynamically injects these high-quality traces into on-policy training, controlled by a group-reward gate. By rephrasing offline guidance instead of direct imitation, RePO maintains stable updates and improves hard-sample learning, outperforming existing on-policy and off-policy baselines on math, general knowledge, and financial-domain benchmarks. The approach yields state-of-the-art performance and demonstrates robust transfer across multiple data sources and task families, highlighting a principled way to fuse heterogeneous knowledge sources in RL for language models.

Abstract

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

TL;DR

Abstract

Paper Structure (33 sections, 11 equations, 4 figures, 5 tables)

This paper contains 33 sections, 11 equations, 4 figures, 5 tables.

Introduction
Preliminaries
Reinforcement Learning with Verifiable Rewards (RLVR)
Group Relative Policy Optimization (GRPO)
Method
Rephrasing Policy Optimization (RePO)
Joint Probability Trajectory Sampling based on Off-policy Knowledge.
Dynamic Guidance Strategy based on Group Reward Distribution.
Training Stability Analysis
Qualitative Comparison of Policy Optimization Methods
Quantitative Experimental Indicators
GRPO: Stable but Low Utilization
LUFFY: Unstable due to Vocabulary Mismatch
RePO: Stable and Correct Updates
Experimental Results
...and 18 more sections

Figures (4)

Figure 1: Challenges in bridging the gap between on-policy exploration and off-policy expertise. Aligning models with domain knowledge while maintaining general reasoning remains a fundamental challenge. SFT forces the model to fit "alien" distributions, harming its generality. Conversely, Pure RL struggles to reach unfamiliar knowledge due to the lack of guidance. Furthermore, a Naive Hybrid approach that simply mixes off-policy data results in optimization conflicts and unstable training dynamics.
Figure 2: Overview of the Rephrasing Policy Optimization (RePO) framework. The pipeline consists of three key phases: (1) Knowledge Internalization: A rephrasing prompt guides the policy model to comprehend external knowledge and rewrite it into its native stylistic distribution, converting off-policy data into on-policy-compatible traces. (2) Dynamic Injection: To minimize instability caused by distribution shifts, the rephrased trace selectively replaces a low-quality rollout only when the group exhibits a high failure rate. (3) Optimization: The final rollout group, potentially containing the guided trace, is updated via the standard GRPO process.
Figure 3: GRPO v.s. RePO on Tencent FinLLM EvalfinLLM-Eval: Stable entropy, stable GradNorm, and rewards with different growth rates. For reasoning tasks (reasoning and calculation, middle row), RePO shows a slight advantage over GRPO, while for knowledge-based tasks (fact and principle, bottom row), RePO's knowledge injection yields significant improvements.
Figure 4: LUFFY v.s. RePO on the OpenR1-Math Hard Subset: Unstable entropy, exploding GradNorm and vanishing reward. Overly difficult samples and vocabulary mismatch rollouts during training lead to a very large and aggressive parameter update, causing gradient explosion and model collapse.

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

TL;DR

Abstract

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)