Table of Contents
Fetching ...

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team, Xiaoyu Zhang, Weihong Pan, Zhichao Ye, Jialin Liu, Yipeng Chen, Nan Wang, Xiaojun Xiang, Weijian Xie, Yifu Wang, Haoyu Ji, Siji Pan, Zhewen Le, Jing Guo, Xianbin Liu, Donghui Shen, Ziqiang Zhao, Haomin Liu, Guofeng Zhang

Abstract

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

Abstract

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.
Paper Structure (14 sections, 1 equation, 8 figures)

This paper contains 14 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: Examples of generated worlds across diverse styles, including photorealistic, science-fiction, game-like, and artistic environments. The joystick interface enables real-time interactive exploration with negligible latency.
  • Figure 2: Overview. In the offline stage, a multi-view-consistent model generates plausible observations that provide 3D anchors and reference appearances. In the online stage, frame model performs fast real-time inference while updating scene content at keyframes.
  • Figure 3: The pipeline of InSpatio-WorldFM. The left part illustrates the conditional novel-view synthesis pipeline of WorldFM. WorldFM takes a reference image $x_{\text{ref}}$ (implicit scene memory), noisy latents $z_t$, and point cloud rendering $\hat{x}_{\text{tgt}}$ (explicit 3D anchor) as inputs, which are spatially concatenated along the width dimension. Reference pose $\pi{\text{ref}}$ and target pose $\pi_{\text{tgt}}$ are also injected as control signals. The frame-based Diffusion Transformer Blocks process these conditions and synthesizes the target view $I_{\text{tgt}}$ in real-time via Distribution Matching Distillation (DMD). The right part shows the detailed architecture of the DiT blocks in WorldFM. Camera geometry control is achieved through the Projection Relative Position Embedding (PRoPE) strategy, enhancing cross-view geometric reasoning. The hybrid spatial memory mechanism combines point cloud rendering (explicit 3D anchor) and reference image (implicit memory), interacting solely through self-attention to achieve robust 3D consistency.
  • Figure 4: Qualitative results of teacher model.
  • Figure 5: Qualitative results of InSpatio-WorldFM.
  • ...and 3 more figures