Table of Contents
Fetching ...

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, Yadan Luo

Abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.
Paper Structure (24 sections, 1 theorem, 20 equations, 10 figures, 5 tables)

This paper contains 24 sections, 1 theorem, 20 equations, 10 figures, 5 tables.

Key Result

theorem 1

Let $\mathbf{Z}_{1:S}$ be a sequence of uncorrupted geometry chunks drawn from the true data distribution. Let $\mathbf{C} = (\mathbf{c}_1^{\text{mix}}, \dots, \mathbf{c}_S^{\text{mix}})$ be the sequence of auxiliary interpolated conditions. Assuming strict causality in both the forward corruption p

Figures (10)

  • Figure 1: From video world models to geometry world models. Video world models predict future RGB in VAE latent space, coupling scene dynamics with appearance reconstruction. As a result, decoded predictions can remain geometrically invalid, with broken scene layout and mislocalized actors despite plausible video appearance. VGGT-World instead uses frozen geometry-foundation features as the latent state and models their temporal evolution directly, yielding a VAE-free, lightweight, and geometry-consistent alternative for future 3D forecasting.
  • Figure 2: Comparison of $v$-prediction and $z$-prediction in frozen VGGT latent space. Left: $z$-prediction consistently achieves substantially higher signal-to-noise ratio than $v$-prediction in the layer-4 latent space. Middle: A target VGGT feature map and its corresponding RGB frame. Right: PCA visualizations of predicted latents during training. $v$-prediction remains noisy even after prolonged training, whereas $z$-prediction progressively recovers structured latent patterns aligned with the target VGGT feature.
  • Figure 3: The short- and mid-term depth forecasting results on KITTI. "Upper Bound" denotes the depth obtained by feeding the real future images into VGGT. "DINO-Foresight" predicts depth in the scale of Depth Anything DBLP:conf/nips/YangKH0XFZ24, leading to a different visualization appearance compared to others.
  • Figure 4: Point cloud forecasting on TartanAir. Gen3R produces structurally disorganized geometry on walls and rooftops, whereas our method preserves coherent structure and yields more accurate predictions.
  • Figure 5: Long-term horizon forecasting on TartanAir.
  • ...and 5 more figures

Theorems & Definitions (2)

  • theorem 1: Sequential ELBO under Trajectory-Consistent Flow Forcing
  • proof