Table of Contents
Fetching ...

Value Flows

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach

TL;DR

Value Flows addresses the limitation of scalar-return RL by modeling the full return distribution with flow-based density estimation. It introduces a distributional flow matching objective that enforces the distributional Bellman equation and a flow-derivative ODE to estimate return variance, which is used to reweight learning toward high-uncertainty transitions. The method supports both offline and offline-to-online RL, showing up to 1.3x higher success rates across a wide set of state-based and image-based benchmarks and strong online fine-tuning performance. This approach provides more informative learning signals and robust decision-making under uncertainty, with publicly available code and reproducible experimental setup.

Abstract

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

Value Flows

TL;DR

Value Flows addresses the limitation of scalar-return RL by modeling the full return distribution with flow-based density estimation. It introduces a distributional flow matching objective that enforces the distributional Bellman equation and a flow-derivative ODE to estimate return variance, which is used to reweight learning toward high-uncertainty transitions. The method supports both offline and offline-to-online RL, showing up to 1.3x higher success rates across a wide set of state-based and image-based benchmarks and strong online fine-tuning performance. This approach provides more informative learning signals and robust decision-making under uncertainty, with publicly available code and reproducible experimental setup.

Abstract

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on state-based and image-based benchmark tasks demonstrate that Value Flows achieves a improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

Paper Structure

This paper contains 49 sections, 12 theorems, 51 equations, 10 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Given the vector field $v_k$ that generates the probability density path $p_{k}$, the new vector field $v_{k + 1}$ generates the new probability density path $p_{k + 1}$.

Figures (10)

  • Figure 1: Value Flows models the return distribution at each time step using a flow-matching model that is optimized to obey the Bellman Equation at each transition.
  • Figure 2: Visualizing the return distribution.(Column 1) The policy completes the task of closing the window and closing the drawer using the buttons to lock and unlock them. (Column 2) C51 predicts a noisy multi-modal distribution and (Column 3) CODAC collapses to a single return mode. (Column 4)Value Flows infers a smooth return histogram resembling the ground-truth return distribution. (Column 5) Quantitatively, Value Flows achieves $3 \times$ lower $1$-Wasserstein distance than alternative methods. See Sec. \ref{['subsec:vis-return-distribution']} for details.
  • Figure 3: Offline-to-online evaluation. Using the same distributional flow-matching objective, Value Flows achieves higher average success rates. See Fig. \ref{['fig:offline-to-online']} for the full results.
  • Figure 4: Regularizing the flow-matching loss is important. The regularization coefficient $\lambda$ needs to be tuned for better performance.
  • Figure 5: Reweighing the flow-matching objective boosts success rates. Choosing the correct confidence weight boosts the performance of Value Flows.
  • ...and 5 more figures

Theorems & Definitions (24)

  • Proposition 1: Informal
  • Proposition 2: Informal
  • Proposition 3: Informal
  • Proposition 4: Informal
  • Lemma 1: Proposition 1 of morimura2010nonparametric
  • proof
  • Lemma 2: Chapter 4 of bellemare2023distributional
  • proof
  • Lemma 3
  • proof
  • ...and 14 more