Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang; Ziniu Li; Tian Xu; Bohan Yang; Tian-Shuo Liu; ChenYang Wang; Xiong-Hui Chen; Yi-Chen Li; Tianyun Yang; Congliang Chen; Yang Yu

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu

Abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Abstract

Paper Structure (35 sections, 3 theorems, 18 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 35 sections, 3 theorems, 18 equations, 13 figures, 1 table, 1 algorithm.

Introduction
Preliminaries
LLM and its MDP Formulation
Basic Introduction on LLM.
MDP Formulation of LLM.
Reinforcement Learning with Verifiable Reward
Limitations of On-Policy Methods
Proposed Method
Towards Value-Based RL for LLMs
Q-Function Parameterization in LLMs.
TBRM.
Off-Policy Value-Based Reinforcement Learning with Replay Buffer (ReVal)
Replay Buffer for Off-Policy Learning
Experiments
Experimental Setup
...and 20 more sections

Key Result

Proposition 1

TBRM does not satisfy Calibrated Initialization. Specifically, setting $r(\tau) = 0$ in the TBRM objective does not yield $\pi^* = \pi_\text{ref}$ as the optimal solution.

Figures (13)

Figure 1: Framework of ReVal. By interpreting LLM logits as $Q$-values, ReVal unifies policy and value within a single model and enables replay-based off-policy updates.
Figure 2: Performance of GRPO across different difficulty levels.
Figure 3: Performance under different data reuse frequencies on tasks with varying difficulty levels.
Figure 4: Training curves of DPSK-R1-Distill-1.5B and Qwen2.5-Math-7B. Curves show the accuracy across seven benchmarks (AIME, AIME25, AMC, MATH, Minerva, Olympiad, and GPQA) as well as the average accuracy.
Figure 5: Training Curves of DPSK-R1-Distill-1.5B with N=1
...and 8 more figures

Theorems & Definitions (5)

Definition 1: Calibrated Initialization
Proposition 1
Proposition 2
Proposition 3
proof

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Abstract

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Authors

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (5)