Table of Contents
Fetching ...

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

TL;DR

DFPO addresses the instability and weak OOD generalization of real-world RL by replacing discrete, independently learned quantiles with a continuous value-flow field defined over a virtual horizon $t\in[0,1]$. A transformer-based backbone feeds a neural ODE flow head $v_\theta$ that evolves return distributions, while a suite of risk-sensitive and geometric consistency constraints stabilizes learning and preserves high-value exploration. The framework includes Uncertainty-Weighted Distributional CFM, Bootstrapped Anchor Regularization, and Geometric Consistency to orient flows along optimal-transport-like trajectories, plus Conditional Value Risk Optimization and tail-shape regularization to tame the distribution's left and right tails. Empirically, DFPO yields strong stability under noisy supervision and robust cross-domain performance on dialogue, math, and science tasks, outperforming PPO, FlowRL, and other baselines and demonstrating that flow-based value modeling with risk control can scale robust RL for real-world applications, including LLM post-training. The approach highlights the practical impact of continuous value flow learning and OT-inspired constraints for stable, generalizable decision-making in complex, noisy environments, with potential for broad adoption in robust RL and policy-alignment tasks.

Abstract

Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

TL;DR

DFPO addresses the instability and weak OOD generalization of real-world RL by replacing discrete, independently learned quantiles with a continuous value-flow field defined over a virtual horizon . A transformer-based backbone feeds a neural ODE flow head that evolves return distributions, while a suite of risk-sensitive and geometric consistency constraints stabilizes learning and preserves high-value exploration. The framework includes Uncertainty-Weighted Distributional CFM, Bootstrapped Anchor Regularization, and Geometric Consistency to orient flows along optimal-transport-like trajectories, plus Conditional Value Risk Optimization and tail-shape regularization to tame the distribution's left and right tails. Empirically, DFPO yields strong stability under noisy supervision and robust cross-domain performance on dialogue, math, and science tasks, outperforming PPO, FlowRL, and other baselines and demonstrating that flow-based value modeling with risk control can scale robust RL for real-world applications, including LLM post-training. The approach highlights the practical impact of continuous value flow learning and OT-inspired constraints for stable, generalizable decision-making in complex, noisy environments, with potential for broad adoption in robust RL and policy-alignment tasks.

Abstract

Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
Paper Structure (60 sections, 22 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 60 sections, 22 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison between the standard scalar value model and DFPO with distributional value flow modeling. The standard value model is sensitive to noisy and biased reward signals, which often leads to unstable value estimation and unreliable advantage learning. In contrast, DFPO models a token-level distributional value flow field across time steps, thereby capturing state information more effectively under noisy supervision. By integrating conditional risk control and consistency constraints along value flow trajectories, DFPO suppresses spurious value fluctuations while preserving high-value exploration. CB denotes the capacity boundary of the model.
  • Figure 2: Token-level advantage estimation for the same response across different methods. Our method demonstrates better alignment between high advantage scores and key words.
  • Figure 3: Comparison of the output value distributions of one token for the answer part. (Left) The standard distributional value flow modeling PPO method shows sharp, unstable lower-tail expansion, indicating excessive risk accumulation and unreliable training variance. (Right) DFPO constrains lower-tail risk (slight contraction) while promoting upper-tail exploratory expansion, balancing noise robustness and generalization in complex scenarios.
  • Figure 4: Comparison of value flow trajectories of one token for the answer part. (Left) The standard distributional value flow modeling PPO method without consistency constraints generates intertwined, chaotic trajectories, causing unstable updates and poor generalization. (Right) DFPO produces smooth, coherent paths for accurate advantage estimation even under OOD conditions.
  • Figure 5: Noise statistics in the various tasks. A significant portion of rewards contains inaccuracies.
  • ...and 8 more figures

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof