Table of Contents
Fetching ...

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun

TL;DR

Batch Adaptation Policy Optimization is introduced, an off-policy RLVR framework to improve the data efficiency in large language models post-training by dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement.

Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

TL;DR

Batch Adaptation Policy Optimization is introduced, an off-policy RLVR framework to improve the data efficiency in large language models post-training by dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement.

Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
Paper Structure (32 sections, 4 theorems, 34 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 4 theorems, 34 equations, 16 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.2

Assume rewards are bounded: $0 \leq r \leq 1$. Let $\pi_{\theta_t}$ be the current policy, $\alpha_1 = \pi_{\theta_{t-v}}$ be the delayed rollout policy, $\alpha_2 = \pi_{\theta_t}$ be the current policy for re-evaluation, $\alpha_3 = \alpha_{\mathcal{B}}$ be the buffer policy distribution, and $I(x where $\delta_1, \delta_3 > 0$ are sufficiently small such that the variance lower bounds remain po

Figures (16)

  • Figure 1: Tracking the sample counts across accuracy groups of the mathematical dataset before and after GRPO post-training.
  • Figure 2: The overview of the (a) on-policy and (b) off-policy RL Post-training framework
  • Figure 3: The workflow of (a) off-policy rollout and (b) off-policy training in our RLVR framework
  • Figure 4: Training Curves of Reward Changes for mathematics, planning, and geometry tasks using DeepSeek Distilled Qwen 1.5B, Qwen2.5 Math 1.5B, and Qwen2.5 VL 3B, respectively.
  • Figure 5: Test Curves of Group Accuracy Changes on AIME for different RLVR methods based on Qwen3 8B. Left: Standard BAPO vs. GRPO. Medium: BAPO (mini test) vs. GRPO. Right: Standard BAPO vs. DAPO.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Definition 3.1: Training Batch Filtering Function
  • Theorem 3.2: Policy Improvement Lower Bound with Adaptive Training Batch
  • Lemma A.1: Kantorovich-Rubenstein duality of total variation distance
  • Theorem A.2: Policy Improvement Lower Bound with Adaptive Training Batch
  • proof
  • Proposition A.3
  • proof