Table of Contents
Fetching ...

The Overlooked Value of Test-time Reference Sets in Visual Place Recognition

Mubariz Zaffar, Liangliang Nan, Sebastian Scherer, Julian F. P. Kooij

TL;DR

This work addresses the train-test domain gap in Visual Place Recognition (VPR) by exploiting the test-time reference map, which contains target-domain images and poses. It introduces Reference-Set-Finetuning (RSF), a simple self-supervised strategy that fine-tunes a VPR model on a finetuning dataset D_ft constructed from the map using augmentations and pose-aware triplet mining, with the loss L_triplet to optimize the embedding space. RSF does not require new data or backbone changes and yields notable improvements in Recall@1 (average ~2.3%) on challenging datasets, while preserving generalization across other test sets and benefiting multiple SOTA backbones/aggregators (e.g., BoQ, SALAD). The approach demonstrates that test-time maps are a practical and effective domain adaptation resource for VPR, with broad applicability and potential for further enhancement through augmentation strategies and formulation variants.

Abstract

Given a query image, Visual Place Recognition (VPR) is the task of retrieving an image of the same place from a reference database with robustness to viewpoint and appearance changes. Recent works show that some VPR benchmarks are solved by methods using Vision-Foundation-Model backbones and trained on large-scale and diverse VPR-specific datasets. Several benchmarks remain challenging, particularly when the test environments differ significantly from the usual VPR training datasets. We propose a complementary, unexplored source of information to bridge the train-test domain gap, which can further improve the performance of State-of-the-Art (SOTA) VPR methods on such challenging benchmarks. Concretely, we identify that the test-time reference set, the "map", contains images and poses of the target domain, and must be available before the test-time query is received in several VPR applications. Therefore, we propose to perform simple Reference-Set-Finetuning (RSF) of VPR models on the map, boosting the SOTA (~2.3% increase on average for Recall@1) on these challenging datasets. Finetuned models retain generalization, and RSF works across diverse test datasets.

The Overlooked Value of Test-time Reference Sets in Visual Place Recognition

TL;DR

This work addresses the train-test domain gap in Visual Place Recognition (VPR) by exploiting the test-time reference map, which contains target-domain images and poses. It introduces Reference-Set-Finetuning (RSF), a simple self-supervised strategy that fine-tunes a VPR model on a finetuning dataset D_ft constructed from the map using augmentations and pose-aware triplet mining, with the loss L_triplet to optimize the embedding space. RSF does not require new data or backbone changes and yields notable improvements in Recall@1 (average ~2.3%) on challenging datasets, while preserving generalization across other test sets and benefiting multiple SOTA backbones/aggregators (e.g., BoQ, SALAD). The approach demonstrates that test-time maps are a practical and effective domain adaptation resource for VPR, with broad applicability and potential for further enhancement through augmentation strategies and formulation variants.

Abstract

Given a query image, Visual Place Recognition (VPR) is the task of retrieving an image of the same place from a reference database with robustness to viewpoint and appearance changes. Recent works show that some VPR benchmarks are solved by methods using Vision-Foundation-Model backbones and trained on large-scale and diverse VPR-specific datasets. Several benchmarks remain challenging, particularly when the test environments differ significantly from the usual VPR training datasets. We propose a complementary, unexplored source of information to bridge the train-test domain gap, which can further improve the performance of State-of-the-Art (SOTA) VPR methods on such challenging benchmarks. Concretely, we identify that the test-time reference set, the "map", contains images and poses of the target domain, and must be available before the test-time query is received in several VPR applications. Therefore, we propose to perform simple Reference-Set-Finetuning (RSF) of VPR models on the map, boosting the SOTA (~2.3% increase on average for Recall@1) on these challenging datasets. Finetuned models retain generalization, and RSF works across diverse test datasets.

Paper Structure

This paper contains 12 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Large-scale VPR training datasets are usually created from Google Street View ali2022gsv, e.g., the GSV-cities dataset. Thus, models trained in these environments perform well (SOTA Recall@5 $\sim98-99\%$) for similar test datasets, e.g., the Tokyo-247 dataset arandjelovic2016netvlad, but suffer in unseen environments, e.g., the railway-tracks of the Nordland dataset nordlanddataset. A train-test domain gap exists, as evident in the T-SNE projection of descriptors computed using BoQ-DinoV2 ali2024boq for randomly sampled images of these datasets. Descriptors from the Tokyo-247 dataset form a single cluster with the GSV-cities dataset, while the Nordland dataset is further away. Creating a finetuning dataset by using the freely available test-time reference images could help bridge the train-test domain gap.
  • Figure 2: Deep learning for VPR usually utilizes a pretrained neural network that is further trained on a VPR dataset in a supervised manner with ground-truth poses. This usual pipeline assumes that we do not have any access to the test environment and that the training dataset is diverse enough to cover features of the test domain. However, there is always a train-test domain gap. We propose that the reference images in the test set are freely available offline in VPR and could be used to finetune VPR methods using simple data augmentations. This novel take on the problem setting of VPR, results in reference-set-finetuned (RSF) models that are more robust than the original trained model.
  • Figure 3: Examples of the augmentations applied to create finetuning queries using Kornia augmentations riba2020kornia. Left-most is the original reference image.
  • Figure 4: Examples of queries that are mismatched by the original BoQ-DinoV2 model but correctly matched by our reference-set-finetuned BoQ-RSF model, except for the last row which demonstrates two BoQ-RSF failure cases.
  • Figure 5: Learned attention for the original BoQ and the BoQ-RSF model on a ground-truth reference image is shown. The RSF model attends more to facades in the building while BoQ attends to edges. These attention masks are for the same BoQ query of the original and the BoQ-RSF model.