Table of Contents
Fetching ...

SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition

Shanshan Wan, Yingmei Wei, Lai Kang, Tianrui Shen, Haixuan Wang, Yee-Hong Yang

TL;DR

This paper tackles Visual Place Recognition under varied conditions by addressing instability in cross-image correlation. It introduces SciceVPR, which fuses multi-layer DINOv2 features through a channel-wise 1×1 fusion and token-mixer collaboration, then distills cross-image invariant knowledge into a lightweight self-enhanced encoder to produce stable global descriptors $X_S$; the training combines a multi-similarity loss $L_{MS}$ with a distillation loss $L_D$, yielding the objective $L_T = \gamma L_{MS} + \eta L_D$. Empirically, SciceVPR-B surpasses state-of-the-art one-stage methods on several datasets, while SciceVPR-L matches or exceeds two-stage models on challenging benchmarks such as MSLS and Tokyo24/7. The approach demonstrates robust generalization across domain shifts and aims to enable efficient single-input VPR without re-ranking, with code to be released for reproducibility.

Abstract

Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at https://github.com/shuimushan/SciceVPR.

SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition

TL;DR

This paper tackles Visual Place Recognition under varied conditions by addressing instability in cross-image correlation. It introduces SciceVPR, which fuses multi-layer DINOv2 features through a channel-wise 1×1 fusion and token-mixer collaboration, then distills cross-image invariant knowledge into a lightweight self-enhanced encoder to produce stable global descriptors ; the training combines a multi-similarity loss with a distillation loss , yielding the objective . Empirically, SciceVPR-B surpasses state-of-the-art one-stage methods on several datasets, while SciceVPR-L matches or exceeds two-stage models on challenging benchmarks such as MSLS and Tokyo24/7. The approach demonstrates robust generalization across domain shifts and aims to enable efficient single-input VPR without re-ranking, with code to be released for reproducibility.

Abstract

Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at https://github.com/shuimushan/SciceVPR.

Paper Structure

This paper contains 15 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Different retrieval results of the same query images acquired using our SciceVPR model and the state-of-the-art CricaVPR cricavpr model to describe an image. The database images are sequentially arranged to pass through the VPR models with the batch size of 2, while we test different query situations where the number of query images is either 1 or 2. We demonstrate the most similar database images for the corresponding queries. Pictures inside an orange frame are in a batch. Red frames and green frames represent incorrect and correct retrieval results, respectively. Results show that CricaVPR produces unstable global descriptors that are affected by the number of input images, whereas our SciceVPR generates both stable and discriminative global features.
  • Figure 2: The results of CricaVPR and SciceVPR-B models on Tokyo24/7, with descriptors' dimensionality of 10752, are compared. The database descriptors of CricaVPR are stored with a sequentially arranged input batch size 16, and its Recall@1 results vary with different query input number or orders. These results are consistently surpassed by our SciceVPR-B model.
  • Figure 3: The structure of Super-CricaVPR and SciceVPR. After training Super-CricaVPR with our proposed multi-layer feature fusion module, we use the output of Super-CricaVPR as supervision for SciceVPR. Only the parameters of conv(1,1) are passed to SciceVPR and are frozen during its training. Features are sequentially organized to pass through the cross-image encoder in Super-CricaVPR, whereas they are only augmented independently in the self-enhanced encoder of SciceVPR. We present the case where $B = 3$ and ${C_2} = 1$.
  • Figure 4: The difference of (a) CricaVPR and (b) Super-CricaVPR in producing regional features. (a) CricaVPR only makes use of the features from the last layer of the adapted DINOv2. The output class token serves to represent the whole image and the multi-level GeM pooling is performed on the 13 regions of the rearranged patch tokens. The 14 regional features are then contatenated and passed to the cross-image encoder. (b) Super-CricaVPR makes full use of the multi-layer features from the frozen DINOv2. The contatenated multi-layer patch tokens are then fused in the channel and spatial dimensions. Similarly, multi-level GeM pooling is performed on the divided 14 regions of the rearranged patch tokens, which are then contatenated and passed to the cross-image encoder.
  • Figure 5: The detailed structure of (a) the multi-layer feature fusion module. Features from the last $M$ layers of frozen DINOv2 are concatenated and fused using a channel-wise $1 \times1$ convolution together with (b) token-wise mixer layers across all spatial token locations.
  • ...and 5 more figures