VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Xiangyu Sun; Shijie Wang; Fengyi Zhang; Lin Liu; Caiyan Jia; Ziying Song; Zi Huang; Yadan Luo

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, Yadan Luo

Abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Abstract

Paper Structure (24 sections, 1 theorem, 20 equations, 10 figures, 5 tables)

This paper contains 24 sections, 1 theorem, 20 equations, 10 figures, 5 tables.

Introduction
Related Work
Our Approach
Geometry World Modeling via Autoregressive Flow
Chunk Transition in High-Dimensional Geometry Space
Latent Flow Forcing Curriculum
Experiments
Depth Forecasting
Point Map Forecasting
Further Analysis
Conclusion and Future Work
Summary of the Supplementary Material.
Theoretical Analysis
Implementation Details
Model Architecture
...and 9 more sections

Key Result

theorem 1

Let $\mathbf{Z}_{1:S}$ be a sequence of uncorrupted geometry chunks drawn from the true data distribution. Let $\mathbf{C} = (\mathbf{c}_1^{\text{mix}}, \dots, \mathbf{c}_S^{\text{mix}})$ be the sequence of auxiliary interpolated conditions. Assuming strict causality in both the forward corruption p

Figures (10)

Figure 1: From video world models to geometry world models. Video world models predict future RGB in VAE latent space, coupling scene dynamics with appearance reconstruction. As a result, decoded predictions can remain geometrically invalid, with broken scene layout and mislocalized actors despite plausible video appearance. VGGT-World instead uses frozen geometry-foundation features as the latent state and models their temporal evolution directly, yielding a VAE-free, lightweight, and geometry-consistent alternative for future 3D forecasting.
Figure 2: Comparison of $v$-prediction and $z$-prediction in frozen VGGT latent space. Left: $z$-prediction consistently achieves substantially higher signal-to-noise ratio than $v$-prediction in the layer-4 latent space. Middle: A target VGGT feature map and its corresponding RGB frame. Right: PCA visualizations of predicted latents during training. $v$-prediction remains noisy even after prolonged training, whereas $z$-prediction progressively recovers structured latent patterns aligned with the target VGGT feature.
Figure 3: The short- and mid-term depth forecasting results on KITTI. "Upper Bound" denotes the depth obtained by feeding the real future images into VGGT. "DINO-Foresight" predicts depth in the scale of Depth Anything DBLP:conf/nips/YangKH0XFZ24, leading to a different visualization appearance compared to others.
Figure 4: Point cloud forecasting on TartanAir. Gen3R produces structurally disorganized geometry on walls and rooftops, whereas our method preserves coherent structure and yields more accurate predictions.
Figure 5: Long-term horizon forecasting on TartanAir.
...and 5 more figures

Theorems & Definitions (2)

theorem 1: Sequential ELBO under Trajectory-Consistent Flow Forcing
proof

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Abstract

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Authors

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)