Table of Contents
Fetching ...

4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation

Haonan Wang, Hanyu Zhou, Haoyue Liu, Luxin Yan

TL;DR

Dynamic scene geometry estimation requires robust spatiotemporal representations. We present 4D-VGGT, a general spatiotemporal foundation model with an adaptive input module, divide-and-conquer feature fusion, and multi-task prediction heads, enabling robust estimation across camera poses, depth, dynamic masks, point maps, and tracking. The model is trained via a two-stage scheme on diverse datasets with a multi-task loss $L = \lambda_{cam} L_{cam} + \lambda_{depth} L_{depth} + \lambda_{mask} L_{mask} + \lambda_{point} L_{point} + \lambda_{track} L_{track}$, achieving strong or state-of-the-art performance on multiple benchmarks. This unified framework offers universality and practical impact for dynamic scenes, providing a new paradigm for multi-task spatiotemporal perception with foundation-model capabilities.

Abstract

We investigate a challenging task of dynamic scene geometry estimation, which requires representing both spatial and temporal features. Typically, existing methods align the two features into a unified latent space to model scene geometry. However, this unified paradigm suffers from potential mismatched representation due to the heterogeneous nature between spatial and temporal features. In this work, we propose 4D-VGGT, a general foundation model with divide-and-conquer spatiotemporal representation for dynamic scene geometry. Our model is divided into three aspects: 1) Multi-setting input. We design an adaptive visual grid that supports input sequences with arbitrary numbers of views and time steps. 2) Multi-level representation. We propose a cross-view global fusion for spatial representation and a cross-time local fusion for temporal representation. 3) Multi-task prediction. We append multiple task-specific heads to spatiotemporal representations, enabling a comprehensive visual geometry estimation for dynamic scenes. Under this unified framework, these components enhance the feature discriminability and application universality of our model for dynamic scenes. In addition, we integrate multiple geometry datasets to train our model and conduct extensive experiments to verify the effectiveness of our method across various tasks on multiple dynamic scene geometry benchmarks.

4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation

TL;DR

Dynamic scene geometry estimation requires robust spatiotemporal representations. We present 4D-VGGT, a general spatiotemporal foundation model with an adaptive input module, divide-and-conquer feature fusion, and multi-task prediction heads, enabling robust estimation across camera poses, depth, dynamic masks, point maps, and tracking. The model is trained via a two-stage scheme on diverse datasets with a multi-task loss , achieving strong or state-of-the-art performance on multiple benchmarks. This unified framework offers universality and practical impact for dynamic scenes, providing a new paradigm for multi-task spatiotemporal perception with foundation-model capabilities.

Abstract

We investigate a challenging task of dynamic scene geometry estimation, which requires representing both spatial and temporal features. Typically, existing methods align the two features into a unified latent space to model scene geometry. However, this unified paradigm suffers from potential mismatched representation due to the heterogeneous nature between spatial and temporal features. In this work, we propose 4D-VGGT, a general foundation model with divide-and-conquer spatiotemporal representation for dynamic scene geometry. Our model is divided into three aspects: 1) Multi-setting input. We design an adaptive visual grid that supports input sequences with arbitrary numbers of views and time steps. 2) Multi-level representation. We propose a cross-view global fusion for spatial representation and a cross-time local fusion for temporal representation. 3) Multi-task prediction. We append multiple task-specific heads to spatiotemporal representations, enabling a comprehensive visual geometry estimation for dynamic scenes. Under this unified framework, these components enhance the feature discriminability and application universality of our model for dynamic scenes. In addition, we integrate multiple geometry datasets to train our model and conduct extensive experiments to verify the effectiveness of our method across various tasks on multiple dynamic scene geometry benchmarks.

Paper Structure

This paper contains 19 sections, 10 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Illustration of our model paradigm and performance. (a) Our 4D-VGGT accommodates various camera settings and adopts a divide-and-conquer spatiotemporal representation approach for different geometry tasks in dynamic scenes. (b) Performance comparison of visual geometry models. Our 4D-VGGT achieves consistently superior performance across various geometry tasks.
  • Figure 2: Framework of our 4D-VGGT. Our 4D-VGGT consists of three parts: 1) Multi-setting vision input. Encode input sequence with DINO and construct adaptive visual grid. 2) Multi-level feature representation. Respectively capture spatial and temporal features in a divide-and-conquer manner. 3) Multi-task geometry prediction. Obtain results for multiple geometry tasks by specific prediction heads.
  • Figure 3: Feature distribution of different camera settings. We use pretrained DINO to analyze the distribution of different camera settings. The similar feature distribution across different settings motivates us to design the adaptive visual grid that accommodates input sequences with arbitrary numbers of views and time steps.
  • Figure 4: Illustration of spatiotemporal representation modules and attention masks. The visual tokens are input into the masked attention module, which performs different attention calculations guided by the attention masks. The spatial masks enable interactions among all tokens from different views within the same time step, while the temporal masks restrict interactions to tokens from the same view within a fixed temporal window.
  • Figure 5: Visual results of depth and dynamic mask estimation.
  • ...and 3 more figures