Table of Contents
Fetching ...

MuDreamer: Learning Predictive World Models without Reconstruction

Maxime Burchi, Radu Timofte

TL;DR

MuDreamer addresses DreamerV3's dependence on pixel reconstruction by learning a task-focused latent world model that predicts rewards, continuation, value, and actions. Inspired by MuZero, it uses an RSSM-based architecture with an action-prediction head and batch normalization to avoid representation collapse, achieving robust performance under visual distractions and competitive Atari100k results with faster training. The approach highlights the importance of KL balancing and demonstrates strong empirical gains on the DeepMind Visual Control Suite, including natural-background settings. By eliminating the reconstruction loss and decoder requirements, MuDreamer reduces architectural overhead while maintaining or improving control performance in continuous and discrete action spaces.

Abstract

The DreamerV3 agent recently demonstrated state-of-the-art performance in diverse domains, learning powerful world models in latent space using a pixel reconstruction loss. However, while the reconstruction loss is essential to Dreamer's performance, it also necessitates modeling unnecessary information. Consequently, Dreamer sometimes fails to perceive crucial elements which are necessary for task-solving when visual distractions are present in the observation, significantly limiting its potential. In this paper, we present MuDreamer, a robust reinforcement learning agent that builds upon the DreamerV3 algorithm by learning a predictive world model without the need for reconstructing input signals. Rather than relying on pixel reconstruction, hidden representations are instead learned by predicting the environment value function and previously selected actions. Similar to predictive self-supervised methods for images, we find that the use of batch normalization is crucial to prevent learning collapse. We also study the effect of KL balancing between model posterior and prior losses on convergence speed and learning stability. We evaluate MuDreamer on the commonly used DeepMind Visual Control Suite and demonstrate stronger robustness to visual distractions compared to DreamerV3 and other reconstruction-free approaches, replacing the environment background with task-irrelevant real-world videos. Our method also achieves comparable performance on the Atari100k benchmark while benefiting from faster training.

MuDreamer: Learning Predictive World Models without Reconstruction

TL;DR

MuDreamer addresses DreamerV3's dependence on pixel reconstruction by learning a task-focused latent world model that predicts rewards, continuation, value, and actions. Inspired by MuZero, it uses an RSSM-based architecture with an action-prediction head and batch normalization to avoid representation collapse, achieving robust performance under visual distractions and competitive Atari100k results with faster training. The approach highlights the importance of KL balancing and demonstrates strong empirical gains on the DeepMind Visual Control Suite, including natural-background settings. By eliminating the reconstruction loss and decoder requirements, MuDreamer reduces architectural overhead while maintaining or improving control performance in continuous and discrete action spaces.

Abstract

The DreamerV3 agent recently demonstrated state-of-the-art performance in diverse domains, learning powerful world models in latent space using a pixel reconstruction loss. However, while the reconstruction loss is essential to Dreamer's performance, it also necessitates modeling unnecessary information. Consequently, Dreamer sometimes fails to perceive crucial elements which are necessary for task-solving when visual distractions are present in the observation, significantly limiting its potential. In this paper, we present MuDreamer, a robust reinforcement learning agent that builds upon the DreamerV3 algorithm by learning a predictive world model without the need for reconstructing input signals. Rather than relying on pixel reconstruction, hidden representations are instead learned by predicting the environment value function and previously selected actions. Similar to predictive self-supervised methods for images, we find that the use of batch normalization is crucial to prevent learning collapse. We also study the effect of KL balancing between model posterior and prior losses on convergence speed and learning stability. We evaluate MuDreamer on the commonly used DeepMind Visual Control Suite and demonstrate stronger robustness to visual distractions compared to DreamerV3 and other reconstruction-free approaches, replacing the environment background with task-irrelevant real-world videos. Our method also achieves comparable performance on the Atari100k benchmark while benefiting from faster training.
Paper Structure (32 sections, 5 equations, 15 figures, 12 tables)

This paper contains 32 sections, 5 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: MuDreamer world model training. A sequence of image observations $o_{1:T}$ is sampled from the replay buffer. The sequence is mapped to hidden representations $x_{1:T}$ using a CNN encoder. At each step, the RSSM computes a posterior state $z_{t}$ representing the current observation $o_{t}$ and a prior state $\hat{z}_{t}$ that predict the posterior without having access to the current observation. Unlike Dreamer, the decoder gradients are not back-propagated to the rest of the model. The hidden representations are learned solely using value, reward, episode continuation and action prediction heads.
  • Figure 2: Reconstruction of MuDreamer model predictions over 64 time steps. We take 5 context frames and generate trajectories of 59 steps into the future using the model sequential and dynamics networks. Actions are predicted using the policy network given generated latent states. MuDreamer generates accurate long-term predictions similar to Dreamer without requiring reconstruction loss gradients during training to compress the observation information into the model hidden state.
  • Figure 3: Agents reconstruction of observations using natural backgrounds for Walker Run, Finger Spin and Quadruped Run tasks. First row shows original sequence of observation, second row shows DreamerV3 reconstruction and third row MuDreamer decoder reconstruction. We observe that DreamerV3 reconstructs general details while MuDreamer learns to filter unnecessary information.
  • Figure 4: Ablations mean scores on the Visual Control Suite using 1M environment steps.
  • Figure 5: Trajectories imagined by the world model over 64 time steps using 5 context frames. MuDreamer generates accurate long-term predictions for various tasks without requiring reconstruction loss gradients during training to compress the observation information into the model hidden state. Although the reconstruction gradients are not propagated to the whole network, the decoder successfully reconstruct the input image, meaning that the model hidden state contains all necessary information about the environment.
  • ...and 10 more figures