Table of Contents
Fetching ...

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, Chun Yuan

TL;DR

This work tackles visual place recognition by bridging the gap between foundation-model pre-training and VPR objectives. It introduces SelaVPR, a parameter-efficient, hybrid global-local adaptation framework built on a frozen ViT backbone with lightweight adapters for global features and an up-sampling module for dense local features. Training leverages a mutual nearest neighbor local feature loss alongside a global triplet loss, enabling effective re-ranking without costly spatial verification. Empirically, SelaVPR achieves state-of-the-art results across multiple VPR benchmarks, with notable improvements in challenging conditions, and dramatically reduced retrieval time due to the elimination of RANSAC-based verification, making it suitable for real-world large-scale deployment.

Abstract

Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

TL;DR

This work tackles visual place recognition by bridging the gap between foundation-model pre-training and VPR objectives. It introduces SelaVPR, a parameter-efficient, hybrid global-local adaptation framework built on a frozen ViT backbone with lightweight adapters for global features and an up-sampling module for dense local features. Training leverages a mutual nearest neighbor local feature loss alongside a global triplet loss, enabling effective re-ranking without costly spatial verification. Empirically, SelaVPR achieves state-of-the-art results across multiple VPR benchmarks, with notable improvements in challenging conditions, and dramatically reduced retrieval time due to the elimination of RANSAC-based verification, making it suitable for real-world large-scale deployment.

Abstract

Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.
Paper Structure (26 sections, 11 equations, 11 figures, 12 tables)

This paper contains 26 sections, 11 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Attention map visualizations of the pre-trained foundation model (DINOv2) and our model. The pre-trained model pays attention to some regions (e.g. dynamic riders) that are useless to identify places. Our method focuses on discriminative regions (buildings and trees).
  • Figure 2: Illustration of the global adaptation. We add a serial adapter (b) after the MHA layer and a parallel adapter (c) in parallel to the MLP layer in each standard transformer block (a) to achieve global adaptation.
  • Figure 3: Illustration of the local adaptation and our two-stage VPR pipeline. The global adapted ViT backbone is applied to extract the feature map. We first use GeM to pool the feature map into the global feature for candidate retrieval. The local adaptation module after the backbone is achieved using up-conv layers, which upsample the feature map to yield dense local features. Then we cross-match the local features between the query image and each candidate for re-ranking.
  • Figure 4: Qualitative results. In these challenging examples (containing condition changes, viewpoint changes, dynamic objects, etc.), the proposed SelaVPR successfully returns the right database images, while all other methods produce incorrect results.
  • Figure 5: R@1-runtime comparison.
  • ...and 6 more figures