Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Jing Liang; Hongyao Tang; Yi Ma; Jinyi Liu; Yan Zheng; Shuyue Hu; Lei Bai; Jianye Hao

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

TL;DR

This paper tackles the sample-inefficiency of on-policy reinforcement finetuning for large language models by introducing ReMix, a general off-policy enhancement for on-policy RL methods like PPO and GRPO. It integrates Mix-policy proximal policy gradient (with increased UTD), a KL-convex policy constraint, and a policy reincarnation mechanism to achieve fast early gains without sacrificing long-term improvement. Empirical results on five math-reasoning benchmarks show ReMix attaining state-of-the-art accuracy at 1.5B and 7B scales while dramatically reducing rollout data volume (by 30x to 450x) compared to baselines. The work also provides rich analyses of how off-policy data shapes reasoning behavior, including impacts on response length and self-reflection, offering practical guidance for efficient RFT of LLMs.

Abstract

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

TL;DR

Abstract

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)