Table of Contents
Fetching ...

DriveVGGT: Visual Geometry Transformer for Autonomous Driving

Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong, Junchi Yan

TL;DR

DriveVGGT tackles the mismatch between VGGT-style 4D reconstruction and autonomous driving by introducing TVA for per-camera temporal processing, MCA with normalized relative pose embeddings for cross-camera consistency, and scale-aware prediction heads. The method yields faster, more accurate pose and depth predictions on nuScenes compared with VGGT-based baselines, and includes ablations that confirm the value of relative pose embeddings, windowed cross-camera attention, and a scale head. These components collectively enable reliable, scale-consistent 4D reconstruction across multi-camera AD setups, with practical benefits for real-time perception and downstream tasks. The approach is validated through comprehensive quantitative results and visual analyses, demonstrating improved efficiency and geometric fidelity in autonomous driving contexts.

Abstract

Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.

DriveVGGT: Visual Geometry Transformer for Autonomous Driving

TL;DR

DriveVGGT tackles the mismatch between VGGT-style 4D reconstruction and autonomous driving by introducing TVA for per-camera temporal processing, MCA with normalized relative pose embeddings for cross-camera consistency, and scale-aware prediction heads. The method yields faster, more accurate pose and depth predictions on nuScenes compared with VGGT-based baselines, and includes ablations that confirm the value of relative pose embeddings, windowed cross-camera attention, and a scale head. These components collectively enable reliable, scale-consistent 4D reconstruction across multi-camera AD setups, with practical benefits for real-time perception and downstream tasks. The approach is validated through comprehensive quantitative results and visual analyses, demonstrating improved efficiency and geometric fidelity in autonomous driving contexts.

Abstract

Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.

Paper Structure

This paper contains 29 sections, 16 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Model comparison. Our method can effectively implement multi-camera relative poses with faster speed and better performance.
  • Figure 2: Overview of DriveVGGT. To inject camera relative poses to feed-froward reconstruction model, we introduce 3 sub-modules:1) Temporal Video Attention module is proposed to achieve video-level attention among images of each camera, and output sequential pose and geometry tokens initially;2) Relative Pose Embedding moudle is implemented to normalize real-world relative poses and pull them to the same dimensions of tokens;3) Multi-camera Consistency Attention Module apply window attention to achieve interaction among all cameras' images in limited sequential length. Finally, DriveVGGT can output 4 kinds of geometry tasks and complete scene reconstruction.
  • Figure 3: Illustration of window attention mechanism.
  • Figure 4: Two-step data enhancement for depth ground truth.
  • Figure 5: Qualitative results of DriveVGGT, VGGT, fastVGGT. We visualize predictions' global points and image poses to compare models' performance comprehensively.
  • ...and 6 more figures