Value-Distributional Model-Based Reinforcement Learning

Carlos E. Luis; Alessandro G. Bottero; Julia Vinogradska; Felix Berkenkamp; Jan Peters

Value-Distributional Model-Based Reinforcement Learning

Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters

TL;DR

This work develops a principled framework for epistemic uncertainty in long-horizon policy evaluation by modeling the entire value distribution under a posterior over MDPs. It introduces the value-distributional Bellman equation and proves that its fixed point corresponds to the posterior value distribution, enabling concrete learning via Epistemic Quantile-Regression (EQR). By integrating EQR with Soft Actor-Critic (SAC), the paper enables differentiable optimization of any distribution-based objective, including mean and risk-sensitive measures. Empirical results in continuous control tasks show that EQR-SAC improves sample efficiency and final performance relative to model-based and model-free baselines, with ablations highlighting the critical role of the quantile-based critic and model-based targets. The approach provides a flexible, uncertainty-aware framework for policy optimization under model uncertainty, with practical benefits for robust and risk-sensitive control.

Abstract

Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective, where the goal is to learn the posterior distribution over value functions induced by parameter (epistemic) uncertainty of the Markov decision process. Previous work restricts the analysis to a few moments of the distribution over values or imposes a particular distribution shape, e.g., Gaussians. Inspired by distributional reinforcement learning, we introduce a Bellman operator whose fixed-point is the value distribution function. Based on our theory, we propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function. We combine EQR with soft actor-critic (SAC) for policy optimization with an arbitrary differentiable objective function of the learned value distribution. Evaluation across several continuous-control tasks shows performance benefits with respect to both model-based and model-free algorithms. The code is available at https://github.com/boschresearch/dist-mbrl.

Value-Distributional Model-Based Reinforcement Learning

TL;DR

Abstract

Paper Structure (28 sections, 7 theorems, 32 equations, 13 figures, 1 algorithm)

This paper contains 28 sections, 7 theorems, 32 equations, 13 figures, 1 algorithm.

Introduction
Our contribution.
Related work
Distributional RL.
Bayesian RL.
Mixed Approaches.
Uncertainty-Aware Policy Optimization.
Background & Notation
Markov Decision Processes
Return-Distributional Reinforcement Learning
Bayesian RL
The Value-Distributional Bellman Equation
Quantile-Regression for Value-Distribution Learning
Policy Optimization with Value Distributions
Posterior Dynamics.
...and 13 more sections

Key Result

Proposition 1

Let $V^\pi$ be the random value function defined in eq:random_value. Then, it holds that for any policy $\pi$ and initial state $s \in \mathcal{S}$.

Figures (13)

Figure 1: Return and value distributions in Bayesian RL. (Left) MDP with uncertain transition probability from $s_0$ given by a random variable $X \in [0, 1]$. (Middle) Return distributions at $s_0$ for realizations of $X$, including the nominal dynamics (green). The return distribution captures the aleatoric noise under the sampled dynamics. (Right) Distribution of values at $s_0$. In the nominal case, the value $v(s_0)$ is a scalar obtained from averaging the aleatoric uncertainty of the return distribution $Z(s_0)$ under the nominal dynamics. In our setting, $V(s_0)$ is a random variable due to the epistemic uncertainty around the MDP dynamics. To sample from $V(s_0)$ is equivalent to first sample $X = \tilde{x}$, compute the conditional return distribution $Z(s_0) | X = \tilde{x}$ and finally average over the aleatoric noise.
Figure 2: Example value distribution. (Left) Uncertain MDP with a truncated Gaussian transition probability $X \sim \bar{\mathcal{N}}(\mu=0.4, \sigma=0.1)$ and a scalar (deterministic) $\beta \in [0, 1]$. For this example, we fixed $\beta=0.9$. (Middle) Distribution over MDPs, which corresponds directly to the distribution of $X$. (Right) Corresponding distribution of values for state $s_0$.
Figure 3: Visualization of the value-distributional Bellman backups, as prescribed by \ref{['eq:distributional_bellman_per_state']}. We identify four operations on distributions: infinite mixture over posterior transition functions (solid braces), shift by reward, scale by discount factor and mixture over next states (broken line braces). The main difference w.r.t the return-distributional backup bellemare_distributional_2023 is the presence of the two distinct mixture operations.
Figure 4: Quantile-regression loss for the example MDP of \ref{['fig:example_value_distribution']}. (Left) Probability density of values for state $s_0$, with five quantile levels in colored vertical lines. (Right) The quantile regression loss \ref{['eq:quantile_regression_loss']} for the five quantile levels; the vertical lines correspond to the minimum of the color-matching loss. The vertical lines on both plots match upto numerical precision, meaning that following the gradient of such a convex loss function would indeed converge to the quantile projection $\Pi_{w_1}\mu$.
Figure 5: Performance of quantile-regression for value-distribution learning in the example MDP of \ref{['fig:example_value_distribution']}. The parameter $\beta$ controls the covariance between $V(s_0)$ and $P(s_2 | s_0)$; the covariance increases with $\beta$ and is zero for $\beta=0$. (Top) Value distributions (Gaussian, bi-modal and heavy-tailed) generated by different prior distributions of the parameter $\delta$. (Middle) Evolution of the per-quantile estimation error $(\Pi_{w_1}\mu(s_0) - \mu_q(s_0))$ between the true quantile projection and the prediction; for $\beta=0$, our algorithm oscillates around the true quantile projection. (Bottom)$1$-Wasserstein metric between the true quantile projection and the estimate $\mu_q$ after $10^4$ gradient steps, as a function of the correlation parameter $\beta$. As $\beta$ moves from zero to one, the regression error increases and the algorithm no longer converges to the true quantiles, although the error is relatively small.
...and 8 more figures

Theorems & Definitions (14)

Proposition 1: Random Variable Value-Distribution Bellman Equation
Definition 1
Definition 2
Lemma 1: Value-Distribution Bellman Equation
Definition 3
Theorem 1
Corollary 1
Example 1: Toy MDP
Example 2: Gridworld
Proposition 1: Random Variable Value-Distribution Bellman Equation
...and 4 more

Value-Distributional Model-Based Reinforcement Learning

TL;DR

Abstract

Value-Distributional Model-Based Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (14)