Table of Contents
Fetching ...

Beyond Pixel Histories: World Models with Persistent 3D State

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

TL;DR

This work presents PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer, and demonstrates novel capabilities, including synthesizing diverse 3D environments from a single image.

Abstract

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

Beyond Pixel Histories: World Models with Persistent 3D State

TL;DR

This work presents PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer, and demonstrates novel capabilities, including synthesizing diverse 3D environments from a single image.

Abstract

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io
Paper Structure (24 sections, 6 equations, 14 figures, 12 tables, 2 algorithms)

This paper contains 24 sections, 6 equations, 14 figures, 12 tables, 2 algorithms.

Figures (14)

  • Figure 1: Initialized with a single pixel frame, PERSIST evolves in an auto-regressive loop in response to user actions . We first denoise the 3D environment centred on the agent in the form of a latent world-frame. Next, camera parameters are predicted with a feed-forward transformer. We then project the world to the camera plane to form a depth-ordered stack of world latents . Finally, pixel latents are denoised, using pixel-aligned 3D information from the world latents stack as guidance.
  • Figure 2: PERSIST enables long-horizon spatial memory by modelling the dynamics of a 3D world-frame around the agent. Camera parameters then act as memory look-up key, fetching relevant features from the world frame via a geometric projection (here visualized as the coloured voxels).
  • Figure 3: PERSIST can be initialized with a single RGB frame (, row 1), or with a single RGB and world frame (+ , row 2). We visualize the world-frames and videos produced by an auto-regressive rollout of 600 timesteps. Even with a single RGB frame for initialization, PERSIST can generate cohesive and evolving worlds.
  • Figure 4: World frame ${\bm{w}}$ features are projected to screen-space to obtain the depth-ordered stack of features $\tilde{{\bm{w}}}_{2D}$ and linear depth information ${\bm{d}}$.
  • Figure 5: Video frames generated over 600 timestep episodes by PERSIST (Ours), Oasis decart2024oasis and WorldMem xiao2025worldmem.
  • ...and 9 more figures