Table of Contents
Fetching ...

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Akshay Mete, Shahid Aamir Sheikh, Tzu-Hsiang Lin, Dileep Kalathil, P. R. Kumar

TL;DR

The paper tackles the persistent exploration challenge in sparse-reward model-based RL by introducing Optimistic World Models (OWMs), which integrate optimistic dynamics directly into world-model learning via a gradient-based RBMLE framework. By augmenting the standard world-model objective with an optimistic dynamics loss and an entropy term, OWMs bias imagined transitions toward higher rewards while retaining scalability and minimal architectural changes. The authors instantiate this framework in two architectures, Optimistic DreamerV3 and Optimistic STORM, achieving substantial gains in sample efficiency and cumulative returns on Atari100K and DeepMind Control benchmarks, especially in sparse-reward settings. The approach emphasizes a plug-and-play, gradient-based optimization path that avoids uncertainty estimates and constrained optimization, with ablations confirming the benefits of modest optimism and entropy regularization. Overall, OWMs offer a practical, scalable route to more efficient exploration in deep model-based RL and set the stage for further theoretical and empirical refinements of RBMLE-inspired methods in large-scale settings.

Abstract

Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

TL;DR

The paper tackles the persistent exploration challenge in sparse-reward model-based RL by introducing Optimistic World Models (OWMs), which integrate optimistic dynamics directly into world-model learning via a gradient-based RBMLE framework. By augmenting the standard world-model objective with an optimistic dynamics loss and an entropy term, OWMs bias imagined transitions toward higher rewards while retaining scalability and minimal architectural changes. The authors instantiate this framework in two architectures, Optimistic DreamerV3 and Optimistic STORM, achieving substantial gains in sample efficiency and cumulative returns on Atari100K and DeepMind Control benchmarks, especially in sparse-reward settings. The approach emphasizes a plug-and-play, gradient-based optimization path that avoids uncertainty estimates and constrained optimization, with ablations confirming the benefits of modest optimism and entropy regularization. Overall, OWMs offer a practical, scalable route to more efficient exploration in deep model-based RL and set the stage for further theoretical and empirical refinements of RBMLE-inspired methods in large-scale settings.

Abstract

Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.
Paper Structure (20 sections, 19 equations, 12 figures, 5 tables)

This paper contains 20 sections, 19 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Optimistic World Models on challenging environments.
  • Figure 2: Optimistic World Model framework: In standard world models, the dynamics model is trained only to fit the real replay buffer. In OWMs, the optimistic dynamics loss uses imagined trajectories to push model transitions toward high-reward outcomes (highlighted by the red arrow), leading to optimistic imaginations.
  • Figure 3: Optimistic World Models on sparse reward environments.
  • Figure 4: Performance of Optimistic World Models: We plot the % gain of Optimistic DreamerV3 and Optimistic STORM over DreamerV3 and STORM respectively. Freeway is not included in Figure \ref{['fig:ostorm_atari_bar']} as the baseline STORM has a score of $0$, while O-STORM achieves a mean score of $6.38$.
  • Figure 5: Ablations and hyperparameter sensitivity results for Cartpole Swingup Sparse (DMC Proprio): (a) Sensitivity to the optimism term $\alpha$: poor performance at $\alpha=0.1$ shows that optimism should be mild. (b) Sensitivity to the model entropy loss coefficient $\eta$: no learning at $\eta=0.03$, while smaller values are beneficial. (c) Performance without the entropy loss: the entropy loss improves the performance of O-DreamerV3. (d) Performance under various decay schedules of $\alpha(t)$. Ablations for additional environments are provided in the Appendix.
  • ...and 7 more figures