Table of Contents
Fetching ...

MagicWorld: Interactive Geometry-driven Video World Exploration

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, Peng-Tao Jiang

TL;DR

MagicWorld addresses instability under viewpoint changes and historical drift in interactive video world models by introducing action-guided 3D geometry priors (AG3D) and a History Cache Retrieval (HCR) mechanism. It starts from a single image and autoregressively generates video conditioned on user actions, leveraging a 3D point-cloud prior and retrieved past latents to constrain viewpoint transitions and preserve scene semantics. A novel WorldBench evaluation and careful ablations demonstrate improved structural stability and long-term continuity over SOTA methods. The approach enables geometry-aware, long-horizon interactive video synthesis with practical implications for embodied vision, simulation, and policy testing.

Abstract

Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.

MagicWorld: Interactive Geometry-driven Video World Exploration

TL;DR

MagicWorld addresses instability under viewpoint changes and historical drift in interactive video world models by introducing action-guided 3D geometry priors (AG3D) and a History Cache Retrieval (HCR) mechanism. It starts from a single image and autoregressively generates video conditioned on user actions, leveraging a 3D point-cloud prior and retrieved past latents to constrain viewpoint transitions and preserve scene semantics. A novel WorldBench evaluation and careful ablations demonstrate improved structural stability and long-term continuity over SOTA methods. The approach enables geometry-aware, long-horizon interactive video synthesis with practical implications for embodied vision, simulation, and policy testing.

Abstract

Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.

Paper Structure

This paper contains 17 sections, 12 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We introduce MagicWorld, an interactive video world model that supports exploring a dynamic scene created from a single scene image through continuous keyboard actions, while maintaining structural and temporal consistency. MagicWorld generates action-driven point clouds from user inputs (W, A, S, D) to provide geometric constraints for stable viewpoint transitions.
  • Figure 2: Overview of the MagicWorld inference pipeline. Given a single scene image and keyboard actions, MagicWorld interactively generates a dynamic world. At each interaction step, the action-guided 3D geometry module produces action-driven point clouds, which are rendered into a point-cloud video and concatenated with the first frame of the current interaction and noise as inputs to the camera-based video DiT. Meanwhile, the current frame latent retrieves the three most similar historical latents from the cache, which are concatenated as history references. The generated frames are finally decoded into video, and the history cache is updated accordingly.
  • Figure 3: Qualitative comparison of different methods on the same scene image under short-term interactions. We illustrate structural preservation and scene coherence across multiple interaction steps, where our method maintains more stable geometry and consistent visual semantics compared with other approaches. Two frames are selected from each interaction for visualization.
  • Figure 4: Qualitative comparison of different methods on the same scene image under long-term interactions. The results show that our method maintains more stable geometry and coherent scene content over extended interactions compared with other methods. Two frames are selected from each interaction for visualization.
  • Figure 5: Qualitative comparison of different model variants. Red boxes highlight regions with noticeable defects caused by removing specific components. Our full model demonstrates better structural stability and semantic consistency across interactions. Two frames are selected from each interaction for visualization.
  • ...and 5 more figures