Table of Contents
Fetching ...

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, Jia Cai, Xiangyang Xue

TL;DR

A Motion-aware Temporal Attention module that learns motion continuity and a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision are introduced.

Abstract

Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

TL;DR

A Motion-aware Temporal Attention module that learns motion continuity and a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision are introduced.

Abstract

Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.
Paper Structure (19 sections, 18 equations, 6 figures, 4 tables)

This paper contains 19 sections, 18 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: DynamicVGGT extends static multi-view 3D perception to dynamic 4D reconstruction by enabling 3D Gaussian rendering and adaptively modeling motion across multiple temporal scales without explicit camera extrinsic alignment.
  • Figure 2: Proposed DynamicVGGT training framework. Given a sequence of multi-view images $\{V_1,V_2,V_3\}$, the model first encodes them using a pretrained DINOv2 backbone to extract patch tokens and camera tokens for each view, while motion tokens are initialized as learnable parameters that encode temporal priors. The patch and camera tokens are processed by the Alternating-Attention (AA) blocks to model intra-frame spatial geometry, whereas the Motion-aware Temporal Attention (MTA) blocks operate in parallel to model inter-frame temporal dependencies using the motion tokens. The resulting temporal features ${TA}$ are then fed into a Dynamic 3D Gaussian Head (DGSHead) for dynamic 3DGS reconstruction and a Future Point Head for future point prediction.
  • Figure 3: Dynamic task formulation. We formulate dynamic point maps by designing two complementary tasks that model point-wise motion over time. The Future Point Head learns implicit motion through inter-frame point consistency, while the Dynamic 3D Gaussian Splatting Head provides explicit motion supervision via scene flow to refine dynamic geometry.
  • Figure 4: Depth and Point Maps Comparison. The sparsity of LiDAR point clouds degrades the results, leading to less smooth depth maps and rougher point clouds.
  • Figure 5: Point map reconstruction. DynamicVGGT reconstructs denser, smoother, and more geometrically consistent point maps than VGGT, maintaining temporal coherence even under large viewpoint or scene changes. Zoom in for better view.
  • ...and 1 more figures