Table of Contents
Fetching ...

VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, Hao Wang

TL;DR

VGGT4D proposes a training-free extension of VGGT to 4D scene reconstruction by mining motion cues from VGGT's global attention. It constructs dynamic masks via cross-layer Gram similarity across a temporal window and sharpens them with a projection-gradient refinement, then applies masks only to shallow layers to decouple dynamic and static regions during inference. The approach yields state-of-the-art results in dynamic segmentation, camera pose estimation, and dense 4D reconstruction across six datasets, and supports long sequences in a single pass. Key contributions include training-free 4D perception, a consistent dynamic-static decoupling pipeline, and strong generalization without external priors or fine-tuning.

Abstract

Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT's early-stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single-pass inference on sequences longer than 500 frames.

VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

TL;DR

VGGT4D proposes a training-free extension of VGGT to 4D scene reconstruction by mining motion cues from VGGT's global attention. It constructs dynamic masks via cross-layer Gram similarity across a temporal window and sharpens them with a projection-gradient refinement, then applies masks only to shallow layers to decouple dynamic and static regions during inference. The approach yields state-of-the-art results in dynamic segmentation, camera pose estimation, and dense 4D reconstruction across six datasets, and supports long sequences in a single pass. Key contributions include training-free 4D perception, a consistent dynamic-static decoupling pipeline, and strong generalization without external priors or fine-tuning.

Abstract

Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT's early-stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single-pass inference on sequences longer than 500 frames.

Paper Structure

This paper contains 28 sections, 9 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Easi3R, built upon DUSt3R, is restricted to two-view inputs and derives dynamic region masks by identifying epipolar-inconsistent pixels. In contrast, our proposed VGGT4D reconstructs dynamic scenes from multi-view inputs by extracting global motion cues from VGGT’s attention maps.
  • Figure 2: Overview of VGGT4D. Input image sequence is fed into VGGT. We compute and aggregate its global attention across selected layers and a temporal window to mine dynamic cues. Followed by a gradient-aware mask refinement, we get accurate dynamic masks. During inference, we apply the masks to early-stage layers and discard unused layer tokens, producing decoupled dynamic/static point clouds and camera pose estimates.
  • Figure 3: Visualization of VGGT's standard camera-image attention $A^{QK}$. At layer 1, attention strongly focuses on semantic regions (e.g., people). While deeper layers can suppress physically dynamic pixels, we observe this behavior is highly scene-dependent and unreliable for robust segmentation. This limitation motivates our search for a more stable dynamic cue (\ref{['sec:cue_extraction']}).
  • Figure 4: Visualization of gram similarity. We visualize each component of $w_\text{shallow}$, $w_\text{middle}$ and $w_\text{deep}$ across different layers, demonstrate their complementary roles in extracting dynamic cues.
  • Figure 5: Qualitative results of dynamic object segmentation. Our method extracts sharp and accurate masks. In contrast, baseline methods suffer from coarse boundaries, missed details, and significant over-segmentation.
  • ...and 4 more figures