Table of Contents
Fetching ...

Next Embedding Prediction Makes World Models Stronger

George Bredis, Nikita Balagansky, Daniil Gavrilov, Ruslan Rakhimov

TL;DR

NE-Dreamer is introduced, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space.

Abstract

Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE-Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE-Dreamer matches or exceeds the performance of DreamerV3 and leading decoder-free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE-Dreamer achieves substantial gains. These results establish next-embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.

Next Embedding Prediction Makes World Models Stronger

TL;DR

NE-Dreamer is introduced, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space.

Abstract

Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE-Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE-Dreamer matches or exceeds the performance of DreamerV3 and leading decoder-free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE-Dreamer achieves substantial gains. These results establish next-embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.
Paper Structure (28 sections, 16 equations, 7 figures, 1 table)

This paper contains 28 sections, 16 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: DMLab Benchmark Summary. Under matched compute and model capacity (50M environment steps; 5 seeds; 12M parameters), NE-Dreamer outperforms strong decoder-based (DreamerV3) and decoder-free world-model baselines (R2-Dreamer, DreamerPro) on the DMLab Rooms memory/navigation tasks.
  • Figure 2: Method overview. NE-Dreamer keeps Dreamer’s RSSM dynamics and imagination-based actor--critic, but replaces same-step pixel reconstruction with next-embedding prediction using a causal temporal transformer, improving long-horizon performance under partial observability.
  • Figure 3: DMLab Rooms: improved long-horizon memory/navigation. Under matched compute and model capacity ($50$M environment steps; 5 seeds; 12M parameters), NE-Dreamer outperforms strong decoder-based (DreamerV3) and decoder-free world-model baselines (R2-Dreamer, DreamerPro) on four Rooms tasks. The largest gains occur when success depends on maintaining state over long horizons rather than reacting to short-lived visual cues.
  • Figure 4: Mechanism on DMLab Rooms: predictive sequence modeling is the key. Under matched compute and model capacity ($50$M environment steps; $5$ seeds; mean$\pm$std), removing the causal temporal transformer (w/o transformer) or removing the next-step target shift (w/o shift) substantially reduces performance. Removing the lightweight projector (w/o projector) mainly affects optimization speed/stability, with smaller impact on final returns.
  • Figure 5: Post-hoc decoder reconstruction reveals temporal consistency. Rows show ground-truth observations (GT) and reconstructions from a post-hoc decoder trained on frozen latents. NE-Dreamer preserves task-relevant objects and spatial layout consistently over time (marked green circles), while same-timestep methods (Dreamer, R2-Dreamer) exhibit temporal inconsistency, where task-specific attributes appear transiently and then fade (marked red circles).
  • ...and 2 more figures