Table of Contents
Fetching ...

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu

Abstract

End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Abstract

End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

Paper Structure

This paper contains 22 sections, 13 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: DVGT-2 is a streaming visual geometry transformer specifically designed for autonomous driving. It inputs multi-view images and jointly predicts the 3D pointmaps, the ego poses, and future trajectory planning in an online manner.
  • Figure 2: Comparison of different paradigms for end-to-end autonomous driving. Conventional end-to-end models rely on sparse perception representations for scene understanding. VLA models predict language descriptions to interpret driving scenarios. Our VGA model reconstructs dense 3D geometry to facilitate safe planning.
  • Figure 3: Comparison of different paradigms for geometry reconstruction. Batch-processing models like DVGT dvgt compute pair-wise relations across all frames, incurring an overall $\mathcal{O}(T^2)$ complexity. Full-history streaming models like StreamVGGT streamvggt extract temporal cues from the entire history, leading to an $\mathcal{O}(T)$ per-frame complexity. In contrast, our sliding-window streaming strategy attends to a fixed-size cache of length $W$, achieving a constant $\mathcal{O}(W)$ per-frame complexity.
  • Figure 4: Overall archetecture of DVGT-2. Our model consists of an image encoder, a geometry transformer with temporal causal attention, and a set of prediction heads to jointly output geometry reconstruction and trajectory planning.
  • Figure 5: Efficient inference of DVGT-2. Given the current frame multi-view input and the cache of past $W$ frames, our model performs efficient geometry reconstruction and trajectory planning in an online manner, avoiding recomputing historical frames.
  • ...and 3 more figures