Enter the Void - Planning to Seek Entropy When Reward is Scarce
Ashish Sundar, Chunbo Luo, Xiaoyang Wang
TL;DR
Sparse rewards and partial observability hinder exploration in real-world RL. The authors propose an inference-time entropy-seeking planner that uses DreamerV3's world model to anticipate informative latent states and actively seek high-entropy futures. A lightweight PPO-based meta-planner governs when to commit to a planned rollout, enabling short, adaptive replanning and reducing dithering. Across MiniWorld Maze, Crafter, and DMC-Vision, this approach improves sample efficiency and convergence speed, achieving faster convergence and higher final returns in several tasks.
Abstract
Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model's short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld's procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer's policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.
