Table of Contents
Fetching ...

EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory

Jiahao Wang, Luoxin Ye, TaiMing Lu, Junfei Xiao, Jiahan Zhang, Yuxiang Guo, Xijun Liu, Rama Chellappa, Cheng Peng, Alan Yuille, Jieneng Chen

TL;DR

EvoWorld tackles long-horizon, spatially coherent panoramic video generation by introducing an explicit evolving 3D memory that is continually reconstructed from generated frames and projected into future viewpoints to condition diffusion-based synthesis. The framework combines a spherical Plücker-based camera pose encoding with a memory-driven reprojection pipeline, enabling fine-grained view control and improved geometric consistency across looped trajectories. It introduces Spatial360, a large-scale panoramic dataset spanning synthetic, indoor, and real-world environments to benchmark loop closure and spatial coherence. Across extensive experiments, EvoWorld achieves superior 2D perceptual quality and 3D spatial coherence, demonstrates strong loop-consistency, and shows tangible benefits for downstream tasks such as target navigation and 3D reconstruction, marking a significant advance in 3D-grounded, long-horizon panoramic world modeling.

Abstract

Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene's 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.

EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory

TL;DR

EvoWorld tackles long-horizon, spatially coherent panoramic video generation by introducing an explicit evolving 3D memory that is continually reconstructed from generated frames and projected into future viewpoints to condition diffusion-based synthesis. The framework combines a spherical Plücker-based camera pose encoding with a memory-driven reprojection pipeline, enabling fine-grained view control and improved geometric consistency across looped trajectories. It introduces Spatial360, a large-scale panoramic dataset spanning synthetic, indoor, and real-world environments to benchmark loop closure and spatial coherence. Across extensive experiments, EvoWorld achieves superior 2D perceptual quality and 3D spatial coherence, demonstrates strong loop-consistency, and shows tangible benefits for downstream tasks such as target navigation and 3D reconstruction, marking a significant advance in 3D-grounded, long-horizon panoramic world modeling.

Abstract

Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene's 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.

Paper Structure

This paper contains 40 sections, 13 equations, 18 figures, 9 tables, 1 algorithm.

Figures (18)

  • Figure 1: Panoramic world generation with explicit 3D memory. EvoWorld maintains an evolving 3D memory (e.g., via VGGT wang2025vggt) based on previously generated frames, and leverages it to guide spatially consistent video generation. The vanilla generator without memory lu2024genex exhibits spatial inconsistency.
  • Figure 2: Overview of EvoWorld. Starting from a single panoramic frame and view control, EvoWorld generates spatially consistent videos by iteratively alternating between 3D reconstruction and video generation, where the video generation is conditioned on reprojections from the evolving 3D memory.
  • Figure 3: Qualitative comparison of long-horizon video generation. EvoWorld produces more spatially consistent and geometrically coherent results than GenEx. In this example, EvoWorld accurately follows the conditional path and preserves building structure, while GenEx struggles to maintain layout consistency due to the absence of 3D memory and precise camera control.
  • Figure 4: Qualitative comparison of 3D reconstructions using four ground truth (GT) images alone, GT with GenEx-generated frames, and with EvoWorld (in the same scale). GT-only reconstructions are incomplete with holes; adding GenEx frames improves coverage but introduces noise. EvoWorld yields more complete and cleaner reconstructions, demonstrating better spatial consistency.
  • Figure S1: Qualitative results of panoramic video generation in the Unity environment. Five key frames are shown from a longer video, with intermediate frames omitted for brevity. GenEx exhibits spatial drift and structural artifacts, while our method (EvoWorld) maintains spatial consistency throughout.
  • ...and 13 more figures