Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

Carlos E. Luis; Alessandro G. Bottero; Julia Vinogradska; Felix Berkenkamp; Jan Peters

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters

TL;DR

A general-purpose policy optimization algorithm that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes is introduced, Q-Uncertainty Soft Actor-Critic (QU-SAC), that demonstrates improved performance compared to other uncertainty estimation methods.

Abstract

We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over Markov decision processes (MDPs). Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

TL;DR

Abstract

Paper Structure (50 sections, 10 theorems, 36 equations, 8 figures, 4 tables, 3 algorithms)

This paper contains 50 sections, 10 theorems, 36 equations, 8 figures, 4 tables, 3 algorithms.

Introduction
Our contribution.
Related work
Model-free Bayesian RL.
Model-based Bayesian RL.
Online RL - Optimism.
Offline RL - Pessimism.
Unified offline / online RL.
Uncertainty in RL.
UBE-based RL.
Problem Statement
Uncertainty-Aware Policy Optimization
Tabular Problems
Practical bound.
Continuous Problems
...and 35 more sections

Key Result

Lemma 1

For any $s \in \mathcal{S}$ and any policy $\pi$, it holds that

Figures (8)

Figure 1: Architecture for $Q$-Uncertainty Soft Actor-Critic (QU-SAC). The dataset $\mathcal{D}$ may be either static, as in offline RL, or be dynamically populated with online interactions. This dataset is used to train an ensemble of dynamics models which is then used for synthetic rollout generation. Each member of the ensemble populates its own buffer $\mathcal{D}_i$, which is used to train a corresponding ensemble of critics. Additionally, member-randomized rollouts are stored in $\mathcal{D}_{\text{model}}$ and used to train a $U$-net, which outputs an estimated epistemic variance of the value prediction. Lastly, the actor aims to optimize the risk-aware objective \ref{['eq:policy_opt']}, which combines the output of the critic ensemble and the $U$-net.
Figure 2: Illustrative example of uncertainty rewards. (Left) ensemble of two value functions $\set{Q_1, Q_2}$. (Right) corresponding mean value function $\bar{Q}$. The theory prescribes estimating the term in \ref{['eq:pombu_rewards']}, denoted $\hat{w}(s,a)$, which captures local variability of $\bar{Q}$ around $(s,a)$. Empirically, $\hat{w}(s,a)$ can be small despite large differences in individual members of the value ensemble, e.g., because $\bar{Q}$ is relatively flat around $(s,a)$. We propose the proxy uncertainty reward $\hat{w}_{\text{ub}}(s, a)$ which directly captures variability across the value ensemble and is less computationally expensive (no dynamics model forward pass).
Figure 3: Performance in the DeepSea benchmark. Lower values in plots indicate better performance. (Left) Learning time is measured as the first episode where the sparse reward has been found at least in 10% of episodes so far. (Right) Total regret is approximately equal to the number of episodes where the sparse reward was not found. Results represent the average over 5 random seeds, and vertical bars on total regret indicate the standard error. Our variance estimate achieves the lowest regret and best scaling with problem size.
Figure 4: Total regret curve for the 7-room environment. Lower regret is better. Results are the average (solid lines) and standard error (shaded regions) over 10 random seeds. Our method achieves the lowest regret, significantly outperforming PSRL.
Figure 5: DeepMind Control Suite Benchmark smoothened learning curves over 500 episodes (500K environment steps). We report the mean (solid) and standard error (shaded region) over five random seeds. QU-SAC with the upper-bound variance estimate outperforms the baselines in 4/6 environments and has the best overall performance.
...and 3 more figures

Theorems & Definitions (21)

Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
proof
Lemma 4
proof
Lemma 5
...and 11 more

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

TL;DR

Abstract

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (21)