Table of Contents
Fetching ...

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen

TL;DR

This work proposes a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving, and introduces a plug-and-play Cross-View 3D Geometric Enabler (CVGE).

Abstract

The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

TL;DR

This work proposes a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving, and introduces a plug-and-play Cross-View 3D Geometric Enabler (CVGE).

Abstract

The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.
Paper Structure (17 sections, 11 equations, 10 figures, 10 tables)

This paper contains 17 sections, 11 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Existing relevant paradigms vs. our VGGDrive. (a) The VLA paradigm for trajectory planning. (b) Two existing paradigms for integrating 3D foundation models (VGGT wang2025vggt) with VLMs: VGGT-Dist huang2025mllms and VGGT-Add zheng2025learning. (c) Our VGGDrive, which leverages the VGGT model to profoundly empower the basic VLM with cross-view geometric grounding capabilities, thereby handling diverse autonomous driving tasks.
  • Figure 2: Quantitative comparison of VGGDrive with specific sota methods across four autonomous driving benchmarks, covering evaluations of attributes such as cross-view risk perception, motion prediction and trajectory planning.
  • Figure 3: Overview of VGGDrive. Specifically, the frozen visual 3D foundation model (VGGT wang2025vggt) extracts geometrically consistent 3D features $V^{3d}$ through cross-view analysis, while the base VLM is decomposed into multiple decoder layers. The proposed CVGE sequentially integrates the shared 3D features $V^{3d}$ with the 2D visual representations $V_{i}^{2d}$, injecting them $V_{i}^{3d}$ through a hierarchical adaptive mechanism, thereby establishing geometric grounding and enabling deep enhancement of the VLM architecture.
  • Figure 4: Visualization of VGGDrive's performance across various autonomous driving attribute evaluation tasks.
  • Figure S1: Ablation analysis of closed-loop trajectory planning performance on the NAVSIM dataset when cross-view 3D geometric empowerment and adaptive injection are applied to individual decoding layers of the LLM.
  • ...and 5 more figures