VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

Jingyi Xu; Zhangshuo Qi; Zhongmiao Yan; Xuyu Gao; Qianyun Jiao; Songpengcheng Xia; Xieyuanli Chen; Ling Pei

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

Jingyi Xu, Zhangshuo Qi, Zhongmiao Yan, Xuyu Gao, Qianyun Jiao, Songpengcheng Xia, Xieyuanli Chen, Ling Pei

TL;DR

VGGT-MPR is proposed, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking, and designs a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability.

Abstract

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 7 figures, 8 tables)

This paper contains 22 sections, 2 equations, 7 figures, 8 tables.

Introduction
Related Work
Multimodal Place Recognition
Foundation Model-Based Place Recognition
Re-Ranking for Place Recognition
Methodology
VGGT-MPR Architecture
Global Retrieval Module
Re-Ranking Mechanism
Experiments and Analyses
Experimental Setups
Datasets
Implementation Details
Comparison With SOTA Methods
Evaluation on Public Datasets
...and 7 more sections

Figures (7)

Figure 1: From VGGT to VGGT-MPR. VGGT is reinterpreted as a unified geometry-centric foundation to address modality-specific limitations in multimodal place recognition. It enhances visual representations with implicit structural awareness, densifies sparse LiDAR observations via depth estimation for global retrieval, and provides reliable cross-view point tracking for training-free re-ranking.
Figure 2: Overview of our proposed VGGT-MPR. Capitalizing on the spatial perception capabilities of VGGT, the pipeline integrates a global retrieval module (GRM) and a re-ranking mechanism (RRM). The GRM fuses multimodal inputs (camera images and LiDAR point clouds) to generate global descriptors for database indexing and retrieval. Subsequently, the RRM refines the retrieved candidates to improve place recognition accuracy.
Figure 3: Our proposed re-ranking mechanism in VGGT-MPR. Given a query image and a corresponding candidate from the top-$k$ matches (e.g., candidates A or B), we perform mask-guided keypoint extraction (a) and confidence-aware correspondence scoring (b) to calculate the score for the input candidate. The top-$k$ candidates are ultimately re-ranked based on their respective correspondence scores to enhance recognition accuracy.
Figure 4: The unmanned ground vehicle (UGV) and collection trajectory for our self-collected data.
Figure 5: Visualization of retrieval results. The first column shows the current query image, the second and third columns show the top-$1$ places retrieved by LCPR and our method, respectively. The value in the bottom-right corner of each retrieved image indicates its distance to the query position.
...and 2 more figures

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

TL;DR

Abstract

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (7)