Table of Contents
Fetching ...

SOMBRL: Scalable and Optimistic Model-Based RL

Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dörfler, Pieter Abbeel, Andreas Krause

TL;DR

SOMBRL tackles sample-efficient exploration in model-based RL with unknown dynamics by learning an uncertainty-aware model and optimizing a blended objective that adds an intrinsic reward proportional to epistemic uncertainty: $J_n(\pi)=\mathbb{E}_{\pi}[\sum_{t=0}^{T-1}(r(\bm{x}'_t,\bm{u}_t)+\lambda_n\|\bm{\sigma}_n(\bm{x}'_t,\bm{u}_t)\|)]$, $\bm{x}'_{t+1}=\bm{\mu}_n(\bm{x}'_t,\bm{u}_t)+\bm{w}_t$. Under GP/RKHS assumptions, the authors prove that this yields optimistic estimates of the true value, enabling sublinear regret in finite-horizon, $\gamma$-discounted infinite-horizon, and nonepisodic settings. The approach is deliberately simple and scalable, avoiding explicit optimization over the dynamics set, and is demonstrated across state-based and visual-control benchmarks as well as hardware, where it outperforms strong baselines and exhibits superior sample efficiency. This framework offers a practical, principled route to exploration in high-dimensional MBRL, with potential extensions to safe and offline RL.

Abstract

We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose Scalable and Optimistic MBRL (SOMBRL), an approach based on the principle of optimism in the face of uncertainty. SOMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent's epistemic uncertainty. SOMBRL is compatible with any policy optimizers or planners, and under common regularity assumptions on the system, we show that SOMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic settings. Additionally, SOMBRL offers a flexible and scalable solution for principled exploration. We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.

SOMBRL: Scalable and Optimistic Model-Based RL

TL;DR

SOMBRL tackles sample-efficient exploration in model-based RL with unknown dynamics by learning an uncertainty-aware model and optimizing a blended objective that adds an intrinsic reward proportional to epistemic uncertainty: , . Under GP/RKHS assumptions, the authors prove that this yields optimistic estimates of the true value, enabling sublinear regret in finite-horizon, -discounted infinite-horizon, and nonepisodic settings. The approach is deliberately simple and scalable, avoiding explicit optimization over the dynamics set, and is demonstrated across state-based and visual-control benchmarks as well as hardware, where it outperforms strong baselines and exhibits superior sample efficiency. This framework offers a practical, principled route to exploration in high-dimensional MBRL, with potential extensions to safe and offline RL.

Abstract

We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose Scalable and Optimistic MBRL (SOMBRL), an approach based on the principle of optimism in the face of uncertainty. SOMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent's epistemic uncertainty. SOMBRL is compatible with any policy optimizers or planners, and under common regularity assumptions on the system, we show that SOMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic settings. Additionally, SOMBRL offers a flexible and scalable solution for principled exploration. We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.

Paper Structure

This paper contains 39 sections, 17 theorems, 72 equations, 11 figures, 2 tables.

Key Result

Lemma 5.3

Let ass:lipschitz_continuity and ass:rkhs_func hold. Then, there exists a $\lambda_n \in \Theta(\sqrt{\Gamma_N})$, such that we have $\forall n > 0$, $\bm{\pi} \in \Pi$, with probability at least $1-\delta$, that $J(\bm{\pi}) \leq J_n(\bm{\pi})$. Moreover, we have $J(\bm{\pi}^*) \leq J_n(\bm{\pi}_n)

Figures (11)

  • Figure 1: Top: We showcase scalability of SOMBRL on visual control tasks from DMC and Atari. Bottom: We evaluate SOMBRL on a highly dynamic RC car where we learn to perform a complex parking maneuver in only 20 real-world episodes.
  • Figure 2: Left: Learning curves for the nonepisodic setting with GP dynamics. We report the average reward $J_{avg}(\bm{\pi}_N)$ and regret $R_N$. The curves are reported with 5 seeds, and we plot the median return with its standard deviation. Right: Learning curves for the episodic setting with GP dynamics. We report the median episode reward $J(\bm{\pi}_N)$ over an episode with 5 seeds and its standard deviation.
  • Figure 3: Left: Learning curves for the state-based tasks from DMC using MBPO as the base algorithm. Across all experiments, MBPO-Optimistic obtains the best performance compared to its greedy variants. MBPO-Optimistic also scales to high-dimensional tasks, specifically the humanoid environments from DMC. Right: Learning curves for the visual control tasks from DMC and Atari using Dreamer as the base algorithm. Dreamer-Optimistic either performs on-par or better than Dreamer in all our experiments. Particularly, in the Venture task from the Atari benchmark, where Dreamer fails to obtain any rewards.
  • Figure 4: Left: Learning curves with action costs, where we compare Dreamer with Dreamer-Optimistic. Dreamer fails to explore sufficiently with action costs, whereas Dreamer-Optimistic is able to explore and obtain much higher performance. Right: Learning curves for our experiments with SimFSVGD. Top row: We change the parameters of the reward function from rothfuss2024bridging, and make it sparse, starting from their dense reward. We observe that, as the reward gets sparser, SimFSVGD drops in performance and SimFSVGD-Optimistic outperforms it. Bottom row: We run the sparse reward configuration on hardware (depicted on the right side at the bottom), where we obtain similar results. As opposed to SimFSVGD-Optimistic, SimFSVGD fails to solve the task.
  • Figure 5: Solution to \ref{['eq: Lipschitz function optimization']} for different values for $B$. Effectively, for larger values for $B$, $\bm{\mu}_n$ and ${\bm{f}}_n$ coincide.
  • ...and 6 more figures

Theorems & Definitions (33)

  • Lemma 5.3
  • Theorem 5.4: Finite horizon setting
  • Theorem 5.5: $\gamma$-discounted, infinite horizon setting
  • Theorem 5.6: Informal statement; nonepisodic average reward case
  • Definition B.1: Well-calibrated statistical model of ${\bm{f}}^*$, rothfuss2023hallucinated
  • Lemma B.2: Well calibrated confidence intervals for RKHS, rothfuss2023hallucinated
  • Lemma B.3
  • proof
  • Lemma B.4
  • proof
  • ...and 23 more