Table of Contents
Fetching ...

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu

Abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
Paper Structure (35 sections, 3 theorems, 18 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 35 sections, 3 theorems, 18 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

TBRM does not satisfy Calibrated Initialization. Specifically, setting $r(\tau) = 0$ in the TBRM objective does not yield $\pi^* = \pi_\text{ref}$ as the optimal solution.

Figures (13)

  • Figure 1: Framework of ReVal. By interpreting LLM logits as $Q$-values, ReVal unifies policy and value within a single model and enables replay-based off-policy updates.
  • Figure 2: Performance of GRPO across different difficulty levels.
  • Figure 3: Performance under different data reuse frequencies on tasks with varying difficulty levels.
  • Figure 4: Training curves of DPSK-R1-Distill-1.5B and Qwen2.5-Math-7B. Curves show the accuracy across seven benchmarks (AIME, AIME25, AMC, MATH, Minerva, Olympiad, and GPQA) as well as the average accuracy.
  • Figure 5: Training Curves of DPSK-R1-Distill-1.5B with N=1
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1: Calibrated Initialization
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof