Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning
Akshay Mete, Shahid Aamir Sheikh, Tzu-Hsiang Lin, Dileep Kalathil, P. R. Kumar
TL;DR
The paper tackles the persistent exploration challenge in sparse-reward model-based RL by introducing Optimistic World Models (OWMs), which integrate optimistic dynamics directly into world-model learning via a gradient-based RBMLE framework. By augmenting the standard world-model objective with an optimistic dynamics loss and an entropy term, OWMs bias imagined transitions toward higher rewards while retaining scalability and minimal architectural changes. The authors instantiate this framework in two architectures, Optimistic DreamerV3 and Optimistic STORM, achieving substantial gains in sample efficiency and cumulative returns on Atari100K and DeepMind Control benchmarks, especially in sparse-reward settings. The approach emphasizes a plug-and-play, gradient-based optimization path that avoids uncertainty estimates and constrained optimization, with ablations confirming the benefits of modest optimism and entropy regularization. Overall, OWMs offer a practical, scalable route to more efficient exploration in deep model-based RL and set the stage for further theoretical and empirical refinements of RBMLE-inspired methods in large-scale settings.
Abstract
Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.
