Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning
Léopold Maytié, Roland Bertin Johannet, Rufin VanRullen
TL;DR
This work introduces GW-Dreamer, a framework that fuses Global Workspace (GW) multimodal latent representations with a Dreamer-style world model for reinforcement learning. By encoding each modality with pretrained VAEs and merging them into a shared latent $z$ via adaptive fusion, the agent can dream within a latent space to learn more efficiently; experiments show substantial sample-efficiency gains, including approximately $2\times 10^{4}$ steps to solve Simple Shapes and $2\times 10^{5}$ steps for Robodesk, outperforming multiple PPO and Dreamer baselines. A key finding is the robustness of GW-based agents to missing modalities, and the ability to amortize GW pretraining across six Robodesk tasks, achieving strong zero-shot transfer to unseen tasks. The results highlight the potential for GW representations to serve as foundation-model-like encoders for world-model-based RL, with implications for scalable, multimodal decision-making in robotics and cognitive-inspired AI.
Abstract
Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.
