CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Feng Lu; Xiangyuan Lan; Lijun Zhang; Dongmei Jiang; Yaowei Wang; Chun Yuan

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, Yaowei Wang, Chun Yuan

TL;DR

CricaVPR addresses robust visual place recognition by introducing cross-image correlation-aware representation learning. It uses a cross-image encoder to propagate information across all images in a batch, enabling condition- and viewpoint-invariant global descriptors, and employs a MulConv adapter for parameter-efficient, multi-scale adaptation of a pre-trained backbone. The approach achieves state-of-the-art results across major VPR benchmarks (e.g., Pitts30k, MSLS, Tokyo24/7, Nordland, SVOX) with significantly reduced training time and parameter overhead. This work demonstrates the value of cross-image cues and lightweight foundation-model adaptation for robust VPR in diverse and challenging environments.

Abstract

Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR.

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 12 figures, 16 tables)

This paper contains 26 sections, 5 equations, 12 figures, 16 tables.

Introduction
Related Work
Methodology
Preliminary
Cross-image Correlation-aware Place Representation
Multi-scale Convolution-enhanced Adaptation
Training Strategy
Experiments
Datasets and Performance Evaluation
Implementation Details
Comparison with State-of-the-Art Methods
Ablation Study
Conclusions
Overview
Visualizations of Place Features using t-SNE
...and 11 more sections

Figures (12)

Figure 1: The Recall@1 and descriptors dimensionality comparison of different methods on Pitts30k. The GCL, NetVLAD, SFRS, and CricaVPR (Ours) all use PCA for dimensionality reduction. Our method can achieve significantly higher Recall@1 than other methods with 512-dim compact global features.
Figure 2: The example of partial images in a batch. (a), (b), and (c) are taken from the same place with different conditions (seasons) and viewpoints. (d), (e), and (c) are captured from different places, but (d) is similar to (c). When the model produces the features of (c), it can harvest relevant information from other images to yield a better representation.
Figure 3: The pipeline to produce the proposed cross-image correlation-aware representation. The cross-image encoder is the core component for modeling correlations between different image features in a batch. Note that we are correlating the $i$-th regional features of all images in a batch, not all regional features of an image. Besides, the cross-image encoder consists of 2 stacked vanilla transformer encoder layers transformer with the LN layer behind the MHA/MLP layer, which is different from that in ViT vit (LN is before MHA/MLP).
Figure 4: Illustration of our multi-scale convolution-enhanced adaptation. (a) is a transformer block in ViT. (b) is the MulConv adapter. We add the MulConv adapter in parallel to the MLP layer in each transformer block to achieve our adaptation as in (c).
Figure 5: Qualitative results. These four challenging examples show severe viewpoint changes and condition changes. The proposed CricaVPR successfully yields the right results, while other methods return incorrect images. In each example, there are methods to return similar images from different places (i.e., incorrect) due to perceptual aliasing. In the second example, the query image is taken at night, causing all the other methods to return night images but from different places (i.e. wrong). However, our method returns an image taken during the day at the same place (i.e. correct).
...and 7 more figures

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

TL;DR

Abstract

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (12)