Register assisted aggregation for Visual Place Recognition
Xuan Yu, Zhenyong Fu
TL;DR
This paper tackles Visual Place Recognition under appearance and viewpoint variations. It introduces RegVPR, a register assisted aggregation method that inserts registers during local descriptor aggregation and uses a Transformer Encoder to reweight features and discard register content, preserving robust static cues. The approach combines a global adapter fine-tuned DINOv2 backbone with a multi-scale feature fusion module and a Sinkhorn-based optimal transport step for clustering. It achieves state-of-the-art results on MSLS, Pitts, NordLand, and SPED with a single-stage pipeline and shows significant robustness to extreme conditions. The work highlights the importance of keeping discriminative structures like buildings while filtering out dynamic background, improving retrieval speed over two-stage methods.
Abstract
Visual Place Recognition (VPR) refers to the process of using computer vision to recognize the position of the current query image. Due to the significant changes in appearance caused by season, lighting, and time spans between query images and database images for retrieval, these differences increase the difficulty of place recognition. Previous methods often discarded useless features (such as sky, road, vehicles) while uncontrolled discarding features that help improve recognition accuracy (such as buildings, trees). To preserve these useful features, we propose a new feature aggregation method to address this issue. Specifically, in order to obtain global and local features that contain discriminative place information, we added some registers on top of the original image tokens to assist in model training. After reallocating attention weights, these registers were discarded. The experimental results show that these registers surprisingly separate unstable features from the original image representation and outperform state-of-the-art methods.
