Table of Contents
Fetching ...

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

TL;DR

The paper tackles the inefficiency and instability of PPO-based RLHF by decoupling value guidance from policy optimization through a pretrained Global Value Model (GVM). It proves that, without new ground-truth rewards, pretraining a reward model and a GVM provide essentially interchangeable supervision for offline policy updates. Empirically, DVPO achieves competitive performance on multiple benchmarks while reducing GPU usage and training time, owing to token-level return-to-go signals and a fixed value guide. This approach offers a scalable path for aligning large language models with human preferences in offline RLHF settings, providing both stability and efficiency advantages for large-scale fine-tuning.

Abstract

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose \textbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained \emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

TL;DR

The paper tackles the inefficiency and instability of PPO-based RLHF by decoupling value guidance from policy optimization through a pretrained Global Value Model (GVM). It proves that, without new ground-truth rewards, pretraining a reward model and a GVM provide essentially interchangeable supervision for offline policy updates. Empirically, DVPO achieves competitive performance on multiple benchmarks while reducing GPU usage and training time, owing to token-level return-to-go signals and a fixed value guide. This approach offers a scalable path for aligning large language models with human preferences in offline RLHF settings, providing both stability and efficiency advantages for large-scale fine-tuning.

Abstract

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose \textbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained \emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Paper Structure

This paper contains 24 sections, 1 theorem, 13 equations, 4 figures, 7 tables.

Key Result

Theorem 3.1

Suppose: Then any policy gradient method that employs either $R_\phi$ or $Q_\psi$ as its supervisory signal will yield policy updates differing by at most a constant factor dependent on $(\epsilon_R,\epsilon_Q)$. As $\epsilon_R,\epsilon_Q \to 0$, the two approaches become equivalent in guiding policy optimiz

Figures (4)

  • Figure 1: Overview of Decoupled Value Policy Optimization (DVPO) and PPO in RLHF. DVPO eliminates the need for a reward model and decouples policy and value learning during policy optimization. In contrast, PPO requires training a reward model before policy optimization. DVPO instead trains a global value model using the same offline data as the reward model. During policy training, no additional ground-truth rewards are obtained.
  • Figure 2: Results of the model on the Ultrafeedback held-out testset. We employed GPT4o as a judge to assess the quality of model-generated responses. Performance is measured using the win rate, where Left represents DVPO, and Right represents the baseline model for comparison.
  • Figure 3: Learning curve of the policy model during the RL stage under the Base setting. DVPO demonstrates faster and more stable convergence compared to other methods.
  • Figure 4: An example of the supervisory signal provided by a Global Value Model (GVM). The GVM is capable of providing token-level feedback. In this example, the GVM assigns a lower value to the incorrect response (response2: "is an island") and a higher value to the critical token "not" in the correct response (response1: "not an island").

Theorems & Definitions (2)

  • Theorem 3.1: Equivalence of Pretrained Reward and GVM
  • proof