Table of Contents
Fetching ...

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

Jingyi Xu, Zhangshuo Qi, Zhongmiao Yan, Xuyu Gao, Qianyun Jiao, Songpengcheng Xia, Xieyuanli Chen, Ling Pei

TL;DR

VGGT-MPR is proposed, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking, and designs a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability.

Abstract

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

TL;DR

VGGT-MPR is proposed, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking, and designs a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability.

Abstract

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.
Paper Structure (22 sections, 2 equations, 7 figures, 8 tables)

This paper contains 22 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: From VGGT to VGGT-MPR. VGGT is reinterpreted as a unified geometry-centric foundation to address modality-specific limitations in multimodal place recognition. It enhances visual representations with implicit structural awareness, densifies sparse LiDAR observations via depth estimation for global retrieval, and provides reliable cross-view point tracking for training-free re-ranking.
  • Figure 2: Overview of our proposed VGGT-MPR. Capitalizing on the spatial perception capabilities of VGGT, the pipeline integrates a global retrieval module (GRM) and a re-ranking mechanism (RRM). The GRM fuses multimodal inputs (camera images and LiDAR point clouds) to generate global descriptors for database indexing and retrieval. Subsequently, the RRM refines the retrieved candidates to improve place recognition accuracy.
  • Figure 3: Our proposed re-ranking mechanism in VGGT-MPR. Given a query image and a corresponding candidate from the top-$k$ matches (e.g., candidates A or B), we perform mask-guided keypoint extraction (a) and confidence-aware correspondence scoring (b) to calculate the score for the input candidate. The top-$k$ candidates are ultimately re-ranked based on their respective correspondence scores to enhance recognition accuracy.
  • Figure 4: The unmanned ground vehicle (UGV) and collection trajectory for our self-collected data.
  • Figure 5: Visualization of retrieval results. The first column shows the current query image, the second and third columns show the top-$1$ places retrieved by LCPR and our method, respectively. The value in the bottom-right corner of each retrieved image indicates its distance to the query position.
  • ...and 2 more figures