StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool
TL;DR
StateSpaceDiffuser addresses the challenge of maintaining long-context memory in diffusion-based world models by integrating a discrete state-space memory (via Mamba) with a diffusion-based generator (DIAMOND) and a fusion module that conditions synthesis on long-range features. The architecture decouples memory from high-fidelity generation, enabling constant-memories and linear-time processing while preserving image quality through diffusion. Two-stage training stabilizes learning and allows swapping memory branches at test time, with extensive experiments on MiniGrid and CSGO showing substantial improvements in long-horizon coherence and perceptual recall, supported by a user study. The results demonstrate that combining state-space reasoning with diffusion yields scalable, temporally coherent visual predictions across long sequences, suggesting a promising direction for robust long-context world modeling.
Abstract
World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.
