LLMs Can Learn to Reason Via Off-Policy RL

Daniel Ritter; Owen Oertell; Bradley Guo; Jonathan Chang; Kianté Brantley; Wen Sun

LLMs Can Learn to Reason Via Off-Policy RL

Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, Wen Sun

TL;DR

This work embraces off-policyness and proposes a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL), which shows that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly available coding model on LiveCodeBench.

Abstract

Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference policies by explicitly modifying the inference engine. In this work, we embrace off-policyness and propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL). We show that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly available coding model, DeepCoder, on LiveCodeBench, while using 3x fewer generations during training. We further empirically demonstrate that models trained via OAPL have improved test time scaling under the Pass@k metric. OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.

LLMs Can Learn to Reason Via Off-Policy RL

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 6 equations, 5 figures, 6 tables, 1 algorithm.

Introduction
Background
Method: OAPL
Off-policy Loss Function
OAPL: The Off-policy RL Algorithm
Comparison to GRPO
Comparison to $A^\star$PO
Related Work
Off-policy RL Post-Training
Off-Policy RL in Asynchronous Settings
Experimental Setup
Math Experimental Setup
Code Generation Experimental Setup
Experimental Results
Results on Competition Math
...and 11 more sections

Figures (5)

Figure 1: OAPL and GRPO on math reasoning benchmarks. Bars show the average of the maximum accuracy across three runs, with error bars indicating standard error. We report Pass@1 (computed via averaging over 10 rollouts per prompt), Pass@5, and Pass@10 on (Left) HMMT-25 (Feb & Nov), (Middle) AIME-25, and (Right) BRUMO-25.
Figure 2: Training curves on competition math. Curves show mean accuracy across three benchmarks (AIME25, HMMT25, BRUMO25) and shaded regions denote standard error. (Left) Pass@1, (Middle) Pass@5, and (Right) Pass@10. OAPL converges to higher accuracy and remains more stable than GRPO over training.
Figure 3: Training dynamics and robustness to policy lag in competitive math. (Left) The training entropy for both OAPL and GRPO (mean across three runs; shaded region is standard error). (Right) Accuracy over training for OAPL with a larger synchronization interval ($K=100$), averaged over AIME, HMMT, and BRUMO; dashed/dotted lines show Pass@1/5/10 computed from 10 rollouts per prompt. OAPL remains stable even under substantially lagged rollouts.
Figure 4: Scaling behaviors of OAPL and GRPO for Pass@k. We observe RL training increases Pass@k for all $k$ ranging from $1$ to $256$. OAPL improves scaling relative to GRPO and the base model. (Left) Average across all benchmarks; remaining panels show per-benchmark results (AIME25, HMMT25 Nov, HMMT25 Feb, BRUMO25).
Figure 5: Code generation results on LiveCodeBench. (Left) Pass@k scaling for OAPL, DeepCoder, and the shared base model. (Right) Sample efficiency: Pass@1 Accuracy versus the number of training generations, highlighting that OAPL matches DeepCoder while using substantially fewer samples. All metrics are computed from 20 rollouts per prompt using the same evaluation protocol as DeepCoder.

LLMs Can Learn to Reason Via Off-Policy RL

TL;DR

Abstract

LLMs Can Learn to Reason Via Off-Policy RL

Authors

TL;DR

Abstract

Table of Contents

Figures (5)