Table of Contents
Fetching ...

DeepVerse: 4D Autoregressive Video Generation as a World Model

Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, Tong He

TL;DR

This work addresses drift and temporal inconsistency in visual world models by introducing DeepVerse, a 4D autoregressive world model that explicitly grounds predictions in geometry through depth and raymap-based representations. It combines a 4D state hat{s}_t = (v_t, g_t) with a geometry-aware memory module and a 4D autoregressive prior to enable long-horizon, spatially coherent video generation conditioned on actions and textual cues. Through synthetic data with precise geometry supervision and a sliding-window long-duration inference mechanism, the approach achieves improved prediction accuracy, visual realism, and scene rationality, while preserving long-term spatial coherence. The study demonstrates that a token-wise fusion of historical 4D information yields superior robustness over channel-wise fusion and highlights the importance of depth modality for geometric understanding in autoregressive video generation.

Abstract

World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.

DeepVerse: 4D Autoregressive Video Generation as a World Model

TL;DR

This work addresses drift and temporal inconsistency in visual world models by introducing DeepVerse, a 4D autoregressive world model that explicitly grounds predictions in geometry through depth and raymap-based representations. It combines a 4D state hat{s}_t = (v_t, g_t) with a geometry-aware memory module and a 4D autoregressive prior to enable long-horizon, spatially coherent video generation conditioned on actions and textual cues. Through synthetic data with precise geometry supervision and a sliding-window long-duration inference mechanism, the approach achieves improved prediction accuracy, visual realism, and scene rationality, while preserving long-term spatial coherence. The study demonstrates that a token-wise fusion of historical 4D information yields superior robustness over channel-wise fusion and highlights the importance of depth modality for geometric understanding in autoregressive video generation.

Abstract

World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.

Paper Structure

This paper contains 23 sections, 6 equations, 12 figures, 2 tables, 2 algorithms.

Figures (12)

  • Figure 1: We introduce DeepVerse130,54,18519,127,241, an interactive world model grounded in 4D autoregressive video generation. By establishing a 4D spatiotemporal distribution of the world, DeepVerse130,54,18519,127,241 enables continuous and coherent 4D future prediction from merely a single input image, effectively modeling both spatial layouts and temporal dynamics simultaneously.
  • Figure 2: Our framework. The inputs to DeepVerse130,54,18519,127,241 consist of: (1) a sequence of $m$ consecutive 4D observations encompassing current and recent estimated states; (2) spatial conditions retrieved from a global memory pool through the selective mechanism $\psi$; (3) textually specified control signals. The system subsequently generates $k$ temporally coherent 4D future observations, which are automatically archived into the global memory repository for persistent world state tracking.
  • Figure 3: (a) Inferring 3D environments from a single image results in inherent scale ambiguity, a latent variable conditioned on the training data. Generating novel views from images alone, without 3D priors, is significantly more challenging than with explicit 3D structures, often leading to geometrically inconsistent extrapolations and error propagation in autoregressive predictions. (b)(c) Text descriptions of perspective changes can be algorithmically derived from camera pose variations.
  • Figure 4: (a) Two MM-DiT-based architectures were designed to inject historical information. (b) Quantitative evaluation results on VBench huang2023vbench demonstrate that Model 2 (Token-wise Concatenation) achieves superior performance in nearly all metrics, exhibiting enhanced visual quality and reduced temporal drift issues compared to alternative architectures.
  • Figure 5: Ablation studies on depth modality. (a) Quantitative results demonstrate that the integration of the depth modality yields superior performance in FVD and consistency. (b) Qualitatively, models incorporating depth exhibit enhanced environmental comprehension, achieving improved visual quality and mitigating temporal drift artifacts compared to the baseline.
  • ...and 7 more figures