Table of Contents
Fetching ...

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, Tengyu Ma

TL;DR

This paper tackles the theoretical gap in model-based deep RL by introducing a meta-algorithm that guarantees monotone improvement to a local optimum through a learned lower bound involving a discrepancy term. It formalizes discrepancy bounds, including norm-based and representation-invariant variants, and proves a telescoping decomposition to connect model error with value discrepancy. The authors instantiate a practical algorithm, SLBO, which alternates model fitting and policy optimization using a multi-step prediction loss, achieving state-of-the-art sample efficiency on continuous-control benchmarks with 1e6 samples or fewer. Empirical results, theoretical guarantees, and the proposed bounds provide a principled framework for design and analysis of non-linear model-based RL, with noted caveats and avenues for future work highlighted in a subsequent update.

Abstract

Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited. This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model. The framework extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires \textit{no explicit} uncertainty quantification. Instantiating our framework with simplification gives a variant of model-based RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves state-of-the-art performance when only one million or fewer samples are permitted on a range of continuous control benchmark tasks.

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

TL;DR

This paper tackles the theoretical gap in model-based deep RL by introducing a meta-algorithm that guarantees monotone improvement to a local optimum through a learned lower bound involving a discrepancy term. It formalizes discrepancy bounds, including norm-based and representation-invariant variants, and proves a telescoping decomposition to connect model error with value discrepancy. The authors instantiate a practical algorithm, SLBO, which alternates model fitting and policy optimization using a multi-step prediction loss, achieving state-of-the-art sample efficiency on continuous-control benchmarks with 1e6 samples or fewer. Empirical results, theoretical guarantees, and the proposed bounds provide a principled framework for design and analysis of non-linear model-based RL, with noted caveats and avenues for future work highlighted in a subsequent update.

Abstract

Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited. This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model. The framework extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires \textit{no explicit} uncertainty quantification. Instantiating our framework with simplification gives a variant of model-based RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves state-of-the-art performance when only one million or fewer samples are permitted on a range of continuous control benchmark tasks.

Paper Structure

This paper contains 35 sections, 17 theorems, 83 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.1

Suppose that $M^\star\in \mathcal{M}$, that $D$ and $d$ satisfy equation eqn:base and eqn:equality, and the optimization problem in equation eqn:obj is solvable at each iteration. Then, Algorithm alg:framework produces a sequence of policies $\pi_0,\dots, \pi_T$ with monotonically increasing values: Moreover, as $k\rightarrow \infty$, the value $V^{\pi_k, M^\star}$ converges to some $V^{\bar{\pi},

Figures (4)

  • Figure 1: Comparison between SLBO (ours), SLBO with squared $\ell^2$ model loss (SLBO-MSE), vanilla model-based TRPO (MB-TRPO), model-free TRPO (MF-TRPO), and Soft Actor-Critic (SAC). We average the results over 10 different random seeds, where the solid lines indicate the mean and shaded areas indicate one standard deviation. The dotted reference lines are the total rewards of MF-TRPO after 8 million steps.
  • Figure 2: Ablation study on multi-step model training. All the experiments are average over 10 random seeds. The x-axis shows the total amount of real samples from the environment. The y-axis shows the averaged return from execution of our learned policy. The solid line is the mean of the total rewards from each seed. The shaded area is one-standard deviation.
  • Figure 3: Ablation study on entropy regularization. $\lambda$ is the coefficient of entropy regularization in the TRPO's objective. All the experiments are averaged over 10 random seeds. The x-axis shows the total amount of real samples from the environment. The y-axis shows the averaged return from execution of our learned policy. The solid line is the mean of the total rewards from each seed. The shaded area is one-standard deviation.
  • Figure 4: Comparison among SLBO (ours), SLBO with squared $\ell^2$ model loss (SLBO-MSE), vanilla model-based TRPO (MB-TRPO), model-free TRPO (MF-TRPO), and Soft Actor-Critic (SAC) with more samples than in Figure \ref{['fig:main-result']}. SLBO, SAC, MF-TRPO are trained with 4 million real samples. We average the results over 10 different random seeds, where the solid lines indicate the mean and shaded areas indicate one standard deviation. The dotted reference lines are the total rewards of MF-TRPO after 8 million steps.

Theorems & Definitions (41)

  • Definition 2.1
  • Theorem 3.1
  • proof : Proof of Theorem \ref{['thm:main']}
  • Lemma 4.1
  • Proposition 4.2
  • Lemma 4.3
  • Proposition 4.4
  • Remark 4.5
  • Definition A.1
  • Proposition A.2
  • ...and 31 more