Table of Contents
Fetching ...

SelaVPR++: Towards Seamless Adaptation of Foundation Models for Efficient Place Recognition

Feng Lu, Tong Jin, Xiangyuan Lan, Lijun Zhang, Yunpeng Liu, Yaowei Wang, Chun Yuan

TL;DR

SelaVPR++ tackles visual place recognition by enabling seamless, efficient adaptation of foundation models through memory-efficient MultiConv adapters that refine frozen backbone features. It introduces a two-stage VPR paradigm using compact binary descriptors for initial retrieval and robust floating-point features for re-ranking, eliminating costly local feature matching while preserving accuracy. A similarity-constrained deep hashing loss with straight-through estimation enables end-to-end training of binary descriptors, and a unified training dataset protocol merges GSV-Cities, SF-XL, Pitts30k, and MSLS for robust supervision. Experiments show SelaVPR++ outperforms prior methods in recognition accuracy and dramatically reduces training and retrieval time, including first-place results on MSLS, while maintaining lower memory footprints. This work advances practical large-scale VPR by combining parameter-efficient adaptation, efficient hashing-based retrieval, and unified multi-dataset training.

Abstract

Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance. In our previous work, we propose a novel method to realize seamless adaptation of foundation models to VPR (SelaVPR). This method can produce both global and local features that focus on discriminative landmarks to recognize places for two-stage VPR by a parameter-efficient adaptation approach. Although SelaVPR has achieved competitive results, we argue that the previous adaptation is inefficient in training time and GPU memory usage, and the re-ranking paradigm is also costly in retrieval latency and storage usage. In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++. Concretely, we first design a parameter-, time-, and memory-efficient adaptation method that uses lightweight multi-scale convolution (MultiConv) adapters to refine intermediate features from the frozen foundation backbone. This adaptation method does not back-propagate gradients through the backbone during training, and the MultiConv adapter facilitates feature interactions along the spatial axes and introduces proper local priors, thus achieving higher efficiency and better performance. Moreover, we propose an innovative re-ranking paradigm for more efficient VPR. Instead of relying on local features for re-ranking, which incurs huge overhead in latency and storage, we employ compact binary features for initial retrieval and robust floating-point (global) features for re-ranking. To obtain such binary features, we propose a similarity-constrained deep hashing method, which can be easily integrated into the VPR pipeline. Finally, we improve our training strategy and unify the training protocol of several common training datasets to merge them for better training of VPR models. Extensive experiments show that ......

SelaVPR++: Towards Seamless Adaptation of Foundation Models for Efficient Place Recognition

TL;DR

SelaVPR++ tackles visual place recognition by enabling seamless, efficient adaptation of foundation models through memory-efficient MultiConv adapters that refine frozen backbone features. It introduces a two-stage VPR paradigm using compact binary descriptors for initial retrieval and robust floating-point features for re-ranking, eliminating costly local feature matching while preserving accuracy. A similarity-constrained deep hashing loss with straight-through estimation enables end-to-end training of binary descriptors, and a unified training dataset protocol merges GSV-Cities, SF-XL, Pitts30k, and MSLS for robust supervision. Experiments show SelaVPR++ outperforms prior methods in recognition accuracy and dramatically reduces training and retrieval time, including first-place results on MSLS, while maintaining lower memory footprints. This work advances practical large-scale VPR by combining parameter-efficient adaptation, efficient hashing-based retrieval, and unified multi-dataset training.

Abstract

Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance. In our previous work, we propose a novel method to realize seamless adaptation of foundation models to VPR (SelaVPR). This method can produce both global and local features that focus on discriminative landmarks to recognize places for two-stage VPR by a parameter-efficient adaptation approach. Although SelaVPR has achieved competitive results, we argue that the previous adaptation is inefficient in training time and GPU memory usage, and the re-ranking paradigm is also costly in retrieval latency and storage usage. In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++. Concretely, we first design a parameter-, time-, and memory-efficient adaptation method that uses lightweight multi-scale convolution (MultiConv) adapters to refine intermediate features from the frozen foundation backbone. This adaptation method does not back-propagate gradients through the backbone during training, and the MultiConv adapter facilitates feature interactions along the spatial axes and introduces proper local priors, thus achieving higher efficiency and better performance. Moreover, we propose an innovative re-ranking paradigm for more efficient VPR. Instead of relying on local features for re-ranking, which incurs huge overhead in latency and storage, we employ compact binary features for initial retrieval and robust floating-point (global) features for re-ranking. To obtain such binary features, we propose a similarity-constrained deep hashing method, which can be easily integrated into the VPR pipeline. Finally, we improve our training strategy and unify the training protocol of several common training datasets to merge them for better training of VPR models. Extensive experiments show that ......

Paper Structure

This paper contains 18 sections, 17 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Heatmap visualizations of feature maps from the pre-trained foundation model, SelaVPR, and SelaVPR++. The pre-trained model pays attention to some regions that are useless for VPR, e.g., dynamic riders. SelaVPR and SelaVPR++ focus on discriminative regions (buildings and trees). Compared with SelaVPR, SelaVPR++ focuses on more landmarks (e.g., smokestack) and eliminates more dynamic interference (e.g., truck), i.e., performs better in detail.
  • Figure 2: Comparison between different transfer learning methods. (a) is the common full fine-tuning, in which all blocks are trainable. (b) is a popular PETL method, where only inner adapters are trainable. But the backpropagation still passes through the entire frozen backbone. (c) is the memory-efficient adaptation following the basic framework of previous work lst, which reduces training memory usage by eliminating the need for backpropagation through the backbone.
  • Figure 3: Illustration of the difference between our memory-efficient MultiConv adaptation network, i.e. (c), and the global adaptation in SelaVPR, i.e. (b). (a) is a transformer block in ViT. Instead of inserting the adapter into the block as (b), we train a parallel side adaptation network as (c), which consists of a series of MultiConv adapters (abbreviated as MCA) to progressively refine the intermediate features from the transformer blocks of the frozen backbone.
  • Figure 4: Illustration of our efficient two-stage VPR pipeline. The frozen foundation model combined with the side adapter networks is applied to extract the feature map. We leverage a linear projection and the GeM pooling (aggregation) to aggregate the feature map as a global descriptor. The branch above produces a compact binary feature for fast candidate retrieval. The branch below outputs a high-dimensional floating-point feature to re-rank the top-k candidates.
  • Figure 5: Qualitative results. In these challenging examples, our SelaVPR++ successfully returns the right database images, while other methods produce incorrect results. In the first two examples, which present drastic viewpoint changes between the query and (correct) database images, other methods wrongly return images similar to the query but from other places. The third example is quite challenging, as the query image is taken at night, showcasing significant changes in both light and viewpoint, with only the right part of the image recognizable but faintly (e.g., traffic light, coffee shop signboard, and special pattern on the building surface). The last query shows a natural scene and contains almost no landmarks. All of these examples require a powerful ability to capture discriminative place details and handle interference in order to obtain accurate results.
  • ...and 3 more figures