Table of Contents
Fetching ...

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Demei Yan, Yuran Wang, Tao Gui

TL;DR

DVPO tackles the instability of RL under noisy supervision in LLM post-training by modeling token-level value distributions and enforcing an asymmetric tail constraint via conditional risk. By employing a multi-head quantile ensemble and distributional GAE, DVPO provides richer supervision than scalar methods while a suite of tail- and curvature-regularization terms balance robustness with exploration. Empirical results across dialogue, math, and science tasks show DVPO outperforms PPO, GRPO, and robust Bellman PPO under noisy signals, with strong in-domain and cross-domain generalization. The approach offers a scalable, robust framework for real-world RL where supervision quality is imperfect and tasks are diverse.

Abstract

Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

TL;DR

DVPO tackles the instability of RL under noisy supervision in LLM post-training by modeling token-level value distributions and enforcing an asymmetric tail constraint via conditional risk. By employing a multi-head quantile ensemble and distributional GAE, DVPO provides richer supervision than scalar methods while a suite of tail- and curvature-regularization terms balance robustness with exploration. Empirical results across dialogue, math, and science tasks show DVPO outperforms PPO, GRPO, and robust Bellman PPO under noisy signals, with strong in-domain and cross-domain generalization. The approach offers a scalable, robust framework for real-world RL where supervision quality is imperfect and tasks are diverse.

Abstract

Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.

Paper Structure

This paper contains 52 sections, 13 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between the Standard Value Model and our Distributional Value Model Based on Conditional Risk Control Optimization (DVPO). The standard value model suffers from reward noise and biased value estimation, leading to unstable policy updates. DVPO introduces a multi-head distributed value model to model the uncertainty of value, and further balances the model's learning of robustness and generalization in noise by means of lower-tail risk contraction and upper-tail exploratory expansion.
  • Figure 2: Token-level advantage estimation for the same response across different methods. Our method exhibits sharper focus on key words.
  • Figure 3: Comparison of the output value distributions of the first-token method for the answer part.The robust Bellman PPO method contracts partially in the lower tail, with small variance during learning but insufficient exploration.Our method also contracts slightly in the lower tail, achieves significant exploratory expansion learning in the upper tail, and maintains a good balance between generalization and robustness.
  • Figure 4: Noise statistics in the various tasks. A significant portion of rewards contains inaccuracies.
  • Figure 5: A multi-turn example from the Honor-Dialogue dataset. The dataset features realistic, task-oriented, multi-domain conversations, in which each model response includes structured states. This example represents the situation of the first conversation.
  • ...and 4 more figures