Table of Contents
Fetching ...

Enter the Void - Planning to Seek Entropy When Reward is Scarce

Ashish Sundar, Chunbo Luo, Xiaoyang Wang

TL;DR

Sparse rewards and partial observability hinder exploration in real-world RL. The authors propose an inference-time entropy-seeking planner that uses DreamerV3's world model to anticipate informative latent states and actively seek high-entropy futures. A lightweight PPO-based meta-planner governs when to commit to a planned rollout, enabling short, adaptive replanning and reducing dithering. Across MiniWorld Maze, Crafter, and DMC-Vision, this approach improves sample efficiency and convergence speed, achieving faster convergence and higher final returns in several tasks.

Abstract

Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model's short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld's procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer's policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.

Enter the Void - Planning to Seek Entropy When Reward is Scarce

TL;DR

Sparse rewards and partial observability hinder exploration in real-world RL. The authors propose an inference-time entropy-seeking planner that uses DreamerV3's world model to anticipate informative latent states and actively seek high-entropy futures. A lightweight PPO-based meta-planner governs when to commit to a planned rollout, enabling short, adaptive replanning and reducing dithering. Across MiniWorld Maze, Crafter, and DMC-Vision, this approach improves sample efficiency and convergence speed, achieving faster convergence and higher final returns in several tasks.

Abstract

Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model's short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld's procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer's policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.

Paper Structure

This paper contains 39 sections, 13 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Episode lengths across different porosity levels. Lower porosity increases maze difficulty.
  • Figure 2: DMC-Vision learning curves (return vs. environment steps) for no-plan, the planning variant, and Plan2Explore. Shaded bands: $\pm$2 standard error of the mean (SEM) across seeds.
  • Figure 3: Episode returns during Crafter training.
  • Figure 4: Comparison of episode lengths during training for ObjectNav (left) and Crafter (right) across different ablations.
  • Figure 5: Entropy/reward ablations for Crafter (left) and DMC (right). We compare training the meta-policy with entropy-only, reward-only, and a 50/50 mixture of both. In both domains, entropy-only and reward-only training of the meta-policy are comparable and outperform no-planner, suggesting robustness to the precise meta-reward weighting.
  • ...and 9 more figures