Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan; Yansheng Wang; Wenqi Huang; Mingyang Sun

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun

TL;DR

Batch Adaptation Policy Optimization is introduced, an off-policy RLVR framework to improve the data efficiency in large language models post-training by dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement.

Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

TL;DR

Abstract

Paper Structure (32 sections, 4 theorems, 34 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 4 theorems, 34 equations, 16 figures, 4 tables, 1 algorithm.

Introduction
Related Work
On-policy RL Post-training Framework
Off-policy RL Post-training Framework
Method
Formal Definitions
Adaptive Training Batch Construction
Theoretical Analysis
Experimental Setup
Results Analysis
Main Results
Mechanism Analysis
Detailed Analysis
Conclusion
Appendix
...and 17 more sections

Key Result

Theorem 3.2

Assume rewards are bounded: $0 \leq r \leq 1$. Let $\pi_{\theta_t}$ be the current policy, $\alpha_1 = \pi_{\theta_{t-v}}$ be the delayed rollout policy, $\alpha_2 = \pi_{\theta_t}$ be the current policy for re-evaluation, $\alpha_3 = \alpha_{\mathcal{B}}$ be the buffer policy distribution, and $I(x where $\delta_1, \delta_3 > 0$ are sufficiently small such that the variance lower bounds remain po

Figures (16)

Figure 1: Tracking the sample counts across accuracy groups of the mathematical dataset before and after GRPO post-training.
Figure 2: The overview of the (a) on-policy and (b) off-policy RL Post-training framework
Figure 3: The workflow of (a) off-policy rollout and (b) off-policy training in our RLVR framework
Figure 4: Training Curves of Reward Changes for mathematics, planning, and geometry tasks using DeepSeek Distilled Qwen 1.5B, Qwen2.5 Math 1.5B, and Qwen2.5 VL 3B, respectively.
Figure 5: Test Curves of Group Accuracy Changes on AIME for different RLVR methods based on Qwen3 8B. Left: Standard BAPO vs. GRPO. Medium: BAPO (mini test) vs. GRPO. Right: Standard BAPO vs. DAPO.
...and 11 more figures

Theorems & Definitions (7)

Definition 3.1: Training Batch Filtering Function
Theorem 3.2: Policy Improvement Lower Bound with Adaptive Training Batch
Lemma A.1: Kantorovich-Rubenstein duality of total variation distance
Theorem A.2: Policy Improvement Lower Bound with Adaptive Training Batch
proof
Proposition A.3
proof

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

TL;DR

Abstract

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (7)