Table of Contents
Fetching ...

Register assisted aggregation for Visual Place Recognition

Xuan Yu, Zhenyong Fu

TL;DR

This paper tackles Visual Place Recognition under appearance and viewpoint variations. It introduces RegVPR, a register assisted aggregation method that inserts registers during local descriptor aggregation and uses a Transformer Encoder to reweight features and discard register content, preserving robust static cues. The approach combines a global adapter fine-tuned DINOv2 backbone with a multi-scale feature fusion module and a Sinkhorn-based optimal transport step for clustering. It achieves state-of-the-art results on MSLS, Pitts, NordLand, and SPED with a single-stage pipeline and shows significant robustness to extreme conditions. The work highlights the importance of keeping discriminative structures like buildings while filtering out dynamic background, improving retrieval speed over two-stage methods.

Abstract

Visual Place Recognition (VPR) refers to the process of using computer vision to recognize the position of the current query image. Due to the significant changes in appearance caused by season, lighting, and time spans between query images and database images for retrieval, these differences increase the difficulty of place recognition. Previous methods often discarded useless features (such as sky, road, vehicles) while uncontrolled discarding features that help improve recognition accuracy (such as buildings, trees). To preserve these useful features, we propose a new feature aggregation method to address this issue. Specifically, in order to obtain global and local features that contain discriminative place information, we added some registers on top of the original image tokens to assist in model training. After reallocating attention weights, these registers were discarded. The experimental results show that these registers surprisingly separate unstable features from the original image representation and outperform state-of-the-art methods.

Register assisted aggregation for Visual Place Recognition

TL;DR

This paper tackles Visual Place Recognition under appearance and viewpoint variations. It introduces RegVPR, a register assisted aggregation method that inserts registers during local descriptor aggregation and uses a Transformer Encoder to reweight features and discard register content, preserving robust static cues. The approach combines a global adapter fine-tuned DINOv2 backbone with a multi-scale feature fusion module and a Sinkhorn-based optimal transport step for clustering. It achieves state-of-the-art results on MSLS, Pitts, NordLand, and SPED with a single-stage pipeline and shows significant robustness to extreme conditions. The work highlights the importance of keeping discriminative structures like buildings while filtering out dynamic background, improving retrieval speed over two-stage methods.

Abstract

Visual Place Recognition (VPR) refers to the process of using computer vision to recognize the position of the current query image. Due to the significant changes in appearance caused by season, lighting, and time spans between query images and database images for retrieval, these differences increase the difficulty of place recognition. Previous methods often discarded useless features (such as sky, road, vehicles) while uncontrolled discarding features that help improve recognition accuracy (such as buildings, trees). To preserve these useful features, we propose a new feature aggregation method to address this issue. Specifically, in order to obtain global and local features that contain discriminative place information, we added some registers on top of the original image tokens to assist in model training. After reallocating attention weights, these registers were discarded. The experimental results show that these registers surprisingly separate unstable features from the original image representation and outperform state-of-the-art methods.
Paper Structure (13 sections, 7 equations, 6 figures, 2 tables)

This paper contains 13 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of heatmap between SALAD model and our method. It can be seen intuitively that the SALAD model has discarded some of the building features (within the green box, which we hope to preserve), but has retained some of the vehicle features (within the red box, which we hope to discard).
  • Figure 2: Illustration of multi-scale feature fusion module. (a) is a standard Transformer block, and (b) is the structure of a multi-scale feature fusion module. We parallelize the multi-scale feature fusion module with the MLP layer in each standard Transformer block to obtain the global adapter (c).
  • Figure 3: Illustration of our VPR pipeline. Firstly, a ViT backbone with a multi-scale feature fusion module is used to extract local features and global labels, followed by score projection to obtain the score matrix for feature-to-cluster. Score projection is essentially a small MLP, and the optimal transport module uses the Sinkhorn algorithm. Then, we explicitly add registers to the sequence, which, along with local features, obtain local descriptors through a score matrix. At this point, the registers do not contain any information from the image. Then, the local descriptors with registers after dimensionality-reduction are fed into a Transformer Encoder with the aim of reallocating feature weights, assigning useless features to registers and discarding them. Finally, the remaining local descriptors are aggregated into the final descriptor and concatenated with the global token.
  • Figure 4: Attention map visualizations of SALAD model and our model. We compute the mean in the channel dimension of the output feature map and display it using the heatmap. The feature map of the SALAD model may contain some features that are not helpful for VPR tasks, such as cars, and discard features that are helpful for retrieval, such as buildings and overpasses. Compared to the visual feature maps of SALAD in the sky and on the road, our method is smoother.
  • Figure 5: Qualitative results. In these four challenging examples (including light changes, viewpoint changes, dynamic objects, and weather changes), our method successfully retrieved the correct database images, while all other methods produced incorrect results.
  • ...and 1 more figures