Table of Contents
Fetching ...

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

TL;DR

This paper tackles the sample-inefficiency of on-policy reinforcement finetuning for large language models by introducing ReMix, a general off-policy enhancement for on-policy RL methods like PPO and GRPO. It integrates Mix-policy proximal policy gradient (with increased UTD), a KL-convex policy constraint, and a policy reincarnation mechanism to achieve fast early gains without sacrificing long-term improvement. Empirical results on five math-reasoning benchmarks show ReMix attaining state-of-the-art accuracy at 1.5B and 7B scales while dramatically reducing rollout data volume (by 30x to 450x) compared to baselines. The work also provides rich analyses of how off-policy data shapes reasoning behavior, including impacts on response length and self-reflection, offering practical guidance for efficient RFT of LLMs.

Abstract

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

TL;DR

This paper tackles the sample-inefficiency of on-policy reinforcement finetuning for large language models by introducing ReMix, a general off-policy enhancement for on-policy RL methods like PPO and GRPO. It integrates Mix-policy proximal policy gradient (with increased UTD), a KL-convex policy constraint, and a policy reincarnation mechanism to achieve fast early gains without sacrificing long-term improvement. Empirical results on five math-reasoning benchmarks show ReMix attaining state-of-the-art accuracy at 1.5B and 7B scales while dramatically reducing rollout data volume (by 30x to 450x) compared to baselines. The work also provides rich analyses of how off-policy data shapes reasoning behavior, including impacts on response length and self-reflection, offering practical guidance for efficient RFT of LLMs.

Abstract

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

Paper Structure

This paper contains 39 sections, 10 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Efficiency-Performance Comparison for 1.5B Models (left) and 7B Models (right) in terms of Rollout Data Volume (i.e., total number of responses generated during training) v.s., Average Pass@1 Accuracy on five math reasoning benchmarks. An ideal model should appear in the top-left corner. Our method ReMix shows superior scores and significantly better training efficiency compared with standard PPO and GRPO. Moreover, ReMix-PPO achieves SOTA-level performance at 1.5B (52.10, 0.079M) and 7B scale (63.27/64.39, 0.007M/0.011M) with an over 30x to 450x reduction in rollout data volume than DeepScaleR (52.14, 2.519M) and AceReason-Nemotron (63.24, 3.584M). The polylines denote the training process with the training step numbers in round brackets.
  • Figure 2: The conceptual illustration of RFT for LLMs with different proximal policy gradient (PPG) methods (denoted by different colors). Starting from a base model, (1) the prevalent on-policy PPG methods (e.g., PPO, GRPO) yield a stable and effective training process, yet exhibit inefficient data utilization (i.e., the orange waved curve). (2) Off-policy PPG offers appealing potential in data efficiency. However, naively adopting off-policy PPG leads to a training collapse (i.e., the less waved green curve). (3) To strike a balance, we introduce Mix-PPG, which manages to boost early-stage performance but still faces a slow asymptotic improvement (denoted by the cyan curve) and even a collapse when adopting a high UTD ratio (i.e., the straight dark green curve). (4) To this end, we propose policy reincarnation and introduce ReMix. ReMix seamlessly takes advantage of both the efficient early-stage training of Mix-PPG and the stable asymptotic improvement of on-policy PPG (i.e., the fusion of the cyan and red curves), thereby achieving significantly better efficiency at almost no compromise of final performance.
  • Figure 3: Training Efficiency Comparison for ReMix-PPO and PPO (1.5B) on MATH and Olympiad. We evaluate training efficiency across three dimensions: rollout data volume, training steps, and training duration. ReMix achieves a score above 40%, around 3x to 6x faster than PPO.
  • Figure 4: Training Dynamics regarding Importance Sampling Ratio, Accuracy, and Response Length under Varying Proportions of Off-policy Data $p$ for Mix-PPG. Leveraging more off-policy data leads to a larger policy distribution shift, a faster early boost in accuracy yet worse later-stage performance, and a shorter response length.
  • Figure 5: Training Dynamics regarding Accuracy, Response Length, Self-reflection Rate for On-policy v.s. Off-policy Training. Mix-PPG significantly increases the early-stage training efficiency, while showing a rapid decrease in response length and self-reflection rate. An increased UTD ratio further enhances the efficiency, but results in a severe degradation in accuracy. ReMix shows a merged learning behavior and perfectly combines the superior efficiency and the asymptotic improvement, thanks to the policy reincarnation.
  • ...and 4 more figures