EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory
Jiahao Wang, Luoxin Ye, TaiMing Lu, Junfei Xiao, Jiahan Zhang, Yuxiang Guo, Xijun Liu, Rama Chellappa, Cheng Peng, Alan Yuille, Jieneng Chen
TL;DR
EvoWorld tackles long-horizon, spatially coherent panoramic video generation by introducing an explicit evolving 3D memory that is continually reconstructed from generated frames and projected into future viewpoints to condition diffusion-based synthesis. The framework combines a spherical Plücker-based camera pose encoding with a memory-driven reprojection pipeline, enabling fine-grained view control and improved geometric consistency across looped trajectories. It introduces Spatial360, a large-scale panoramic dataset spanning synthetic, indoor, and real-world environments to benchmark loop closure and spatial coherence. Across extensive experiments, EvoWorld achieves superior 2D perceptual quality and 3D spatial coherence, demonstrates strong loop-consistency, and shows tangible benefits for downstream tasks such as target navigation and 3D reconstruction, marking a significant advance in 3D-grounded, long-horizon panoramic world modeling.
Abstract
Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene's 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.
