Table of Contents
Fetching ...

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao

Abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

Abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: InSpatio-World: Toward a Versatile 4D World Simulator.Top: Our framework enables the synthesis of diverse dynamic scenes from a single video, supporting real-time, high-DoF interactive 4D roaming experiences. Middle: The system is driven by those core capabilities: Free Spatial Roaming along user-defined camera trajectories, Temporal Control over dynamic scene evolution, and the maintenance of Physical Realism. Bottom: These capabilities endow InSpatio-World with the potential to serve as a real-time 4D novel-view rendering engine, promising to support downstream tasks such as Embodied Intelligence and Autonomous Driving.
  • Figure 2: Architecture of the Spatiotemporal Autoregressive Framework and JDMD Pipeline. The framework constructs a spatiotemporal cache using reference information and historical generations, leveraging depth-based warping to establish explicit geometric constraints for consistent autoregressive video generation. The JDMD phase features a multi-task distillation mechanism with shared weights, supervised by a dual-teacher architecture comprising perceptual and motion teachers.
  • Figure 3: Quantitative comparison on WorldScore-Dynamic. Each bubble represents a method, with the vertical axis showing the score of WorldScore-Dynamic and the horizontal axis showing model parameters $\times$ inference steps. InSpatio-World achieves a dynamic score of 68.72 with a significantly lower computational overhead, demonstrating a superior compute-quality trade-off by breaking the zero-sum game between geometric control and generation fidelity.
  • Figure 4: Qualitative comparison on RE10K-Long dataset. Qualitative comparison on RE10K-Long. For each of the two scenes, the leftmost image represents the input Source image. For each method, the top row displays the intermediate frame of the generated sequence, while the bottom row showcases the final frame. As generation progresses, baseline methods exhibit varying degrees of failure, such as camera pose drift or structural warping. In contrast, InSpatio-World maintains precise trajectory control and persistent geometric consistency throughout the extended sequence.
  • Figure 5: Qualitative comparison on Camera Controlled Video Rerendering. Each row represents a distinct scene. From left to right: the first frame of the reference video, the warped final frame, and the final frames generated by TrajectoryCrafter, ReCamMaster, NeoVerse, and our method. Compared to existing methods, our approach yields higher structural fidelity to the original scene and delivers significantly better textural details. Simultaneously, it demonstrates superior instruction-following, achieving precise camera trajectories that are nearly identical to the rendered ground truth. The reference frames showcased are sampled from online video platforms and are utilized exclusively for academic demonstration purposes.