Table of Contents
Fetching ...

On-Policy RL with Optimal Reward Baseline

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei

TL;DR

The paper addresses instability and inefficiency in RLHF for large language model alignment and reasoning. It introduces On-Policy RL with Optimal Reward Baseline (OPO), combining exact on-policy training with a variance-minimizing baseline $b^*$, which is simplified for sequence generation to a length-weighted form. The method yields a single policy objective without KL or entropy regularization and removes auxiliary components, while delivering superior performance on math reasoning benchmarks and more diverse outputs. Empirically, OPO demonstrates stable training dynamics, reduced policy shifts, and enhanced exploration, with implementation merged into the verl library for practical use.

Abstract

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.

On-Policy RL with Optimal Reward Baseline

TL;DR

The paper addresses instability and inefficiency in RLHF for large language model alignment and reasoning. It introduces On-Policy RL with Optimal Reward Baseline (OPO), combining exact on-policy training with a variance-minimizing baseline , which is simplified for sequence generation to a length-weighted form. The method yields a single policy objective without KL or entropy regularization and removes auxiliary components, while delivering superior performance on math reasoning benchmarks and more diverse outputs. Empirically, OPO demonstrates stable training dynamics, reduced policy shifts, and enhanced exploration, with implementation merged into the verl library for practical use.

Abstract

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.

Paper Structure

This paper contains 28 sections, 19 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Training dynamics of on-policy and off-policy training. Left: Training rewards; Middle: KL divergence; Right: Entropy.
  • Figure 2: Left: Comparison of KL divergence and math performance between OPO and GRPO. Both OPO and GRPO follow the exact on-policy training from the SFT policy. The x-axis represents KL divergence, and the y-axis denotes math performance. Middle: Training dynamics of KL divergence. Right: Training dynamics of entropy.
  • Figure 3: Training dynamics of OPO and Reinforce++. Both OPO and Reinforce++ follow the exact on-policy training. Left: Training rewards; Middle: KL divergence; Right: Entropy.