Table of Contents
Fetching ...

Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning

Léopold Maytié, Roland Bertin Johannet, Rufin VanRullen

TL;DR

This work introduces GW-Dreamer, a framework that fuses Global Workspace (GW) multimodal latent representations with a Dreamer-style world model for reinforcement learning. By encoding each modality with pretrained VAEs and merging them into a shared latent $z$ via adaptive fusion, the agent can dream within a latent space to learn more efficiently; experiments show substantial sample-efficiency gains, including approximately $2\times 10^{4}$ steps to solve Simple Shapes and $2\times 10^{5}$ steps for Robodesk, outperforming multiple PPO and Dreamer baselines. A key finding is the robustness of GW-based agents to missing modalities, and the ability to amortize GW pretraining across six Robodesk tasks, achieving strong zero-shot transfer to unseen tasks. The results highlight the potential for GW representations to serve as foundation-model-like encoders for world-model-based RL, with implications for scalable, multimodal decision-making in robotics and cognitive-inspired AI.

Abstract

Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.

Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning

TL;DR

This work introduces GW-Dreamer, a framework that fuses Global Workspace (GW) multimodal latent representations with a Dreamer-style world model for reinforcement learning. By encoding each modality with pretrained VAEs and merging them into a shared latent via adaptive fusion, the agent can dream within a latent space to learn more efficiently; experiments show substantial sample-efficiency gains, including approximately steps to solve Simple Shapes and steps for Robodesk, outperforming multiple PPO and Dreamer baselines. A key finding is the robustness of GW-based agents to missing modalities, and the ability to amortize GW pretraining across six Robodesk tasks, achieving strong zero-shot transfer to unseen tasks. The results highlight the potential for GW representations to serve as foundation-model-like encoders for world-model-based RL, with implications for scalable, multimodal decision-making in robotics and cognitive-inspired AI.

Abstract

Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.

Paper Structure

This paper contains 29 sections, 3 equations, 8 figures, 20 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the Global Workspace model for multimodal representation. Raw environment inputs (image pixels, simulation attributes) are encoded in their latent unimodal representation ($u^{v}$ or $u^{attr}$) thanks to pretrained (and frozen) VAEs. These unimodal latent representations are then processed by encoders $e_v$ and $e_{attr}$ (respectively) to produce pre-GW representations ($z^v$ and $z^{attr}$). The final Global Workspace representation $z \in \mathcal{Z}$ is obtained by fusing these pre-GW representations through an element-wise weighted sum (with weights $\alpha_v\geq 0$ and $\alpha_{attr}\geq 0$, $\alpha_v+\alpha_{attr}=1$) followed by a Tanh activation. The unimodal latent vectors can be retrieved from $z$ with a set of decoders $d_v$ and $d_{attr}$. The GW component networks $e_{v}$, $e_{attr}$, $d_{v}$ and $d_{attr}$ are trained by combining a contrastive loss $\mathcal{L}_{cont}$ and a broadcast $\mathcal{L}_{broad}$. The former encourages the pre-GW representations to align across modalities; the latter also promotes this objective (see devillers_semi-supervised_2024), and ensures that decoded or "broadcasted" GW representations resemble the original unimodal latent representations, regardless of each modality's initial contribution to the GW representation (as captured by the fusion weights $\alpha_v$ and $\alpha_{attr}$). The GW module can be trained jointly with the rest of the model or pre-trained and subsequently frozen during the learning of the World Model and RL policy (Figure \ref{['GW_Dreamer']}), using fixed fusion weights ($\alpha_v = \alpha_{attr} = 0.5$).
  • Figure 2: Illustration of the behaviour of the GW. We start from a fixed attribute vector describing a small red egg-shape (top right), and two images (top left) that are chosen to be incongruent with the attribute vector, in terms of color (right-most image) or both color and size (left-most image). These inputs are encoded into the GW using different fusion weights $\alpha_v$ and $\alpha_{attr}$, indicated in green below each configuration, and subsequently decoded into an image. The resulting images at the bottom illustrate three distinct functional modes of operation. In the translation mode (tr, bottom right), both modalities are encoded, but only attribute information is transmitted through the GW, while visual input is disregarded. The reconstructed images, obtained by decoding the GW latent vector $z$ as an image using $d_v$ and the visual VAE, demonstrate the successful translation of attribute information into the visual domain (both objects are small and red). In the demi-cycle mode (dcy, bottom left), both modalities are encoded, but only the visual information is propagated through the GW. The absence of distortions due to attributes information in the reconstructed images confirms that attribute information was effectively suppressed. In the fusion mode (bottom middle), both modalities are encoded with equal weights, allowing information from both sources to be integrated inside the GW. The decoded images reflect a hybrid representation of vision and attributes features, resulting in an intermediate color and size.
  • Figure 3: (1) World Model training: At each time step, the environment provides observations ($o^v_t$, $o^{attr}_t$), a reward $r_t$, and a termination signal $d_t$. A pretrained and frozen Global Workspace (GW) model, incorporating a Variational Autoencoder (VAE) for each modality, encodes observations into a GW representation $z_t$. The WM is trained on sequences of data collected from the environment using the current AC policy. Given $z_t$ and the action $a_t$ predicted by the policy, the WM (implemented as a GRU: Gated Recurrent Unit) updates its internal state from $h_t$ to $h_{t+1}$. Using this updated state, the WM predicts the next GW representation $z_{t+1}$, the expected reward $r_{t+1}$, and the termination signal $d_{t+1}$ with three separate prediction heads. The loss function $\mathcal{L}_{WM}$ is computed as a weighted sum of the Mean Squared Error (MSE) for $z_{t+1}$ and $r_{t+1}$, and the Binary Cross-Entropy (BCE) loss for predicting $d_{t+1}$. (2) Actor-Critic training: The AC model is trained using "mental simulation". The GW representation $z_t$ derived from observations is provided only at the first time step. For subsequent steps, the WM generates novel states by processing the previously predicted GW representation and the action selected by the AC. The AC loss functions are computed exclusively from the predicted elements within the simulated trajectory, including the generated termination signal $\hat{d}$, reward $\hat{r}$, and actions taken based on the latent state $h$.
  • Figure 4: Illustration of the environments and the tasks used in this study. For Simple Shapes (on the left), the Figure presents examples of raw observations, including four example images and one example set of attributes. The agent's goal is to place the shape at the center and pointing upward. The agent can move the shape one pixel at a time in four directions (up, down, left, right) or rotate it clockwise or counterclockwise by an angle of $\frac{\pi}{32}$. For Robodesk (on the right), the observations consist of fixed RGB images along with values representing the proprioception of the robotic arm and information about the objects in the scene. The actions are continuous and allow the robot arm to move, rotate, and open or close its end-effector. The agent's task is to turn on the green light by pressing the green button.
  • Figure 5: Performance (cumulative sum of rewards or "return") as a function of the number of environment steps (log scale) during training in Simple Shapes environment on the left and Robodesk on the right. A fixed baseline, corresponding to the performance of a fully random policy, was subtracted from the episode returns. Thus, a random policy's performance is equal to zero. The returns are smoothed using a sliding window of length 10, with the shaded region indicating the standard error of the mean over this window. The return criterion is defined as 75% of the maximum smoothed return in Simple Shapes and 70% in Robodesk. It corresponds (as verified visually) to a performance at which the task starts to be solved properly.
  • ...and 3 more figures