Trust the Model Where It Trusts Itself -- Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

Bernd Frauenknecht; Artur Eisele; Devdutt Subhasish; Friedrich Solowjow; Sebastian Trimpe

Trust the Model Where It Trusts Itself -- Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, Sebastian Trimpe

TL;DR

This work tackles the data inefficiency of model-free RL by introducing MACURA, a model-based actor-critic that adapts model-based rollouts based on local uncertainty. It defines a trustworthy region $\\mathcal{E}$ using a geometric Jensen-Shannon (GJS) divergence-based uncertainty measure $u_{GJS}$ and proves a monotonic-improvement bound when rollouts are confined to $\\mathcal{E}$. The method uses an adaptive threshold $\\kappa$ and a simple rollout horizon mechanism, with environment exploration (notably pink noise) to expand $\\mathcal{E}$ over time. Empirical results on MuJoCo show MACURA delivers superior data efficiency and competitive or superior asymptotic performance compared to MBPO, M2AC, and SAC, while requiring less hyperparameter tuning.

Abstract

Dyna-style model-based reinforcement learning (MBRL) combines model-free agents with predictive transition models through model-based rollouts. This combination raises a critical question: 'When to trust your model?'; i.e., which rollout length results in the model providing useful data? Janner et al. (2019) address this question by gradually increasing rollout lengths throughout the training. While theoretically tempting, uniform model accuracy is a fallacy that collapses at the latest when extrapolating. Instead, we propose asking the question 'Where to trust your model?'. Using inherent model uncertainty to consider local accuracy, we obtain the Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption (MACURA) algorithm. We propose an easy-to-tune rollout mechanism and demonstrate substantial improvements in data efficiency and performance compared to state-of-the-art deep MBRL methods on the MuJoCo benchmark.

Trust the Model Where It Trusts Itself -- Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

TL;DR

This work tackles the data inefficiency of model-free RL by introducing MACURA, a model-based actor-critic that adapts model-based rollouts based on local uncertainty. It defines a trustworthy region

using a geometric Jensen-Shannon (GJS) divergence-based uncertainty measure

and proves a monotonic-improvement bound when rollouts are confined to

. The method uses an adaptive threshold

and a simple rollout horizon mechanism, with environment exploration (notably pink noise) to expand

over time. Empirical results on MuJoCo show MACURA delivers superior data efficiency and competitive or superior asymptotic performance compared to MBPO, M2AC, and SAC, while requiring less hyperparameter tuning.

Abstract

Paper Structure (40 sections, 5 theorems, 62 equations, 22 figures, 8 tables, 4 algorithms)

This paper contains 40 sections, 5 theorems, 62 equations, 22 figures, 8 tables, 4 algorithms.

Introduction
Background
Reinforcement Learning
Probabilistic Ensemble Models
Dyna-Style Model-Based Reinforcement Learning
Where to Trust your Model?
Monotonic Improvement under Dynamics Misalignment on $\mathcal{E}$
Formulation of Monotonic Improvement
Interpretation of the Result
Constructing $\mathcal{E}$ from Model Uncertainty
Defining $\mathcal{E}$ in Practice
Efficient Measure for Model Uncertainty
Illustrative Example
MACURA: Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption
Uncertainty-Based Rollout Adaption
...and 25 more sections

Key Result

Theorem 4.1

Suppose the expected return following policy $\pi$ under $\hat{\mathcal{M}}$ is denoted by $\eta[\pi]$ and $\tilde{\eta}[\pi]$ describes the expected return following $\pi$ under $\tilde{\mathcal{M}}$, then we can define a lower bound for $\eta[\pi]$ on $\mathcal{E} \subseteq \mathcal{S}$ of the for with

Figures (22)

Figure 1: Dyna-style MBRL. An agent with policy $\pi$ interacts with the environment $\mathcal{M}$. This data is stored in $\mathcal{D}_{\mathrm{env}}$ and used to train a dynamics model $\tilde{p}$ via supervised learning (SL). Model-based rollouts under $\pi$ are performed from start states $s_0$ in $\mathcal{D}_{\mathrm{env}}$ and stored in $\mathcal{D}_{\mathrm{mod}}$. The policy is trained on $\mathcal{D}_{\mathrm{mod}}$ via reinforcement learning (RL).
Figure 2: Where to trust your Model? $\mathcal{D}_{\mathrm{env}}$ induces a set of sufficient model accuracy $\mathcal{E} \subseteq \mathcal{S}$. A notion of $\mathcal{E}$ allows to reason whether rollouts are in a region of sufficient model accuracy. We use this resoning to schedule rollout length.
Figure 3: Constructing $\mathcal{E}$ on a toy example. (a) Data to train the PE model. (b) Dynamics misalignment. (c) Proposed measure for model uncertainty \ref{['eq:u_gjs']}. (d) Set of sufficient model accuracy to perform branched model-based rollouts \ref{['eq:subset_gjs']}.
Figure 4: Performance on the MuJoCo Benchmark. MACURA shows substantial improvements in data efficiency and asymptotic performance over state-of-the-art Dyna-style MBRL approaches (MBPO, M2AC) in most tasks. Most noticeably, MACURA is on par with or outperforms the asymptotic performance of the model-free SAC baseline.
Figure 5: Exploration Schemes on MACURA and MBPO. Impact of deterministic (D), white noise (WN), and pink noise (PN) exploration on algorithmic performance.
...and 17 more figures

Theorems & Definitions (10)

Theorem 4.1
proof
Lemma 2.1: Return mismatch with respect to state distribution shift
proof
Lemma 2.2: Recursive Formulation
proof
Lemma 2.3: Dependency on dynamics mismatch
proof
Theorem 2.4: Monotonic Improvement under Dynamics Misalignment on $\mathcal{E} \subseteq \mathcal{S}$
proof

Trust the Model Where It Trusts Itself -- Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

TL;DR

Abstract

Trust the Model Where It Trusts Itself -- Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (10)