Table of Contents
Fetching ...

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool

TL;DR

StateSpaceDiffuser addresses the challenge of maintaining long-context memory in diffusion-based world models by integrating a discrete state-space memory (via Mamba) with a diffusion-based generator (DIAMOND) and a fusion module that conditions synthesis on long-range features. The architecture decouples memory from high-fidelity generation, enabling constant-memories and linear-time processing while preserving image quality through diffusion. Two-stage training stabilizes learning and allows swapping memory branches at test time, with extensive experiments on MiniGrid and CSGO showing substantial improvements in long-horizon coherence and perceptual recall, supported by a user study. The results demonstrate that combining state-space reasoning with diffusion yields scalable, temporally coherent visual predictions across long sequences, suggesting a promising direction for robust long-context world modeling.

Abstract

World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

TL;DR

StateSpaceDiffuser addresses the challenge of maintaining long-context memory in diffusion-based world models by integrating a discrete state-space memory (via Mamba) with a diffusion-based generator (DIAMOND) and a fusion module that conditions synthesis on long-range features. The architecture decouples memory from high-fidelity generation, enabling constant-memories and linear-time processing while preserving image quality through diffusion. Two-stage training stabilizes learning and allows swapping memory branches at test time, with extensive experiments on MiniGrid and CSGO showing substantial improvements in long-horizon coherence and perceptual recall, supported by a user study. The results demonstrate that combining state-space reasoning with diffusion yields scalable, temporally coherent visual predictions across long sequences, suggesting a promising direction for robust long-context world modeling.

Abstract

World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.

Paper Structure

This paper contains 43 sections, 1 equation, 24 figures, 8 tables.

Figures (24)

  • Figure 1: Recalling content long in the past. Given a history of images $I_1,...,I_T$ and accompanying actions, we navigate all the way back to the beginning - $I_1$. The task is to generate frames along the way, consistent to what is seen in the history, given the actions. As an example, we show the predictions of the first frame - $\hat{I_1}$. Can a generative model recall the content of $I_1$ long back in the sequence? Diffusion models fall short (✗), our model correctly recalls the content of $I_1$ (✓).
  • Figure 1: MiniGrid Quantitative Evaluation of Long-Context Awareness. Our StateSpaceDiffuser outperforms the baselines.
  • Figure 2: Our Approach. While diffusion models are limited to a short sequence input, our approach enables long-context processing for diffusion models with a state-space representation.
  • Figure 2: Generalization to Longer Context. Our model, trained on context length 50, generalizes to longer sequences (context 100 and 150).
  • Figure 3: Architecture of our StateSpaceDiffuser model. It consists of: a state-space model for processing long context information; a diffusion model generating high-fidelity context-aware next observation, conditioned on state-space features.
  • ...and 19 more figures