Table of Contents
Fetching ...

Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

Stephen Hausler, Peyman Moghadam

TL;DR

Pair-VPR tackles Visual Place Recognition by introducing a two-stage pipeline that first pre-trains a Vision Transformer using place-aware Siamese Masked Image Modeling, then jointly optimizes a global descriptor and a pair classifier for re-ranking. The approach leverages transformer-based encoders/decoders and a place-focused pre-training regime to achieve state-of-the-art Recall@1 on multiple benchmark datasets, with further gains possible by scaling encoder size. Key contributions include the place-aware MIM pre-training, the two-stage training objective combining metric learning with pairwise classification, and an efficient two-stage inference scheme that balances speed and accuracy. The method has practical impact for robust localization and loop-closure detection in robotics and mapping, offering a scalable path to higher recall through larger transformer models.

Abstract

In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders. The Pair-VPR website is: https://csiro-robotics.github.io/Pair-VPR.

Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

TL;DR

Pair-VPR tackles Visual Place Recognition by introducing a two-stage pipeline that first pre-trains a Vision Transformer using place-aware Siamese Masked Image Modeling, then jointly optimizes a global descriptor and a pair classifier for re-ranking. The approach leverages transformer-based encoders/decoders and a place-focused pre-training regime to achieve state-of-the-art Recall@1 on multiple benchmark datasets, with further gains possible by scaling encoder size. Key contributions include the place-aware MIM pre-training, the two-stage training objective combining metric learning with pairwise classification, and an efficient two-stage inference scheme that balances speed and accuracy. The method has practical impact for robust localization and loop-closure detection in robotics and mapping, offering a scalable path to higher recall through larger transformer models.

Abstract

In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders. The Pair-VPR website is: https://csiro-robotics.github.io/Pair-VPR.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed Pair-VPR method. Left: In Stage 1 of training, we train a ViT encoder and decoder using Siamese mask image modelling with place-aware image sampling. Right: In Stage 2, we re-use the pre-trained encoder and decoder and train specifically for the VPR task, jointly learning a global descriptor and a pair classifier.
  • Figure 2: During inference, we pass in pairs of (query, database) images and the network produces a score estimating whether or not the pair of images are from the same location or not, along with a global descriptor per image.
  • Figure 3: Recall@1 as the number of re-ranked candidates is increased for Pair-VPR-s.
  • Figure 4: Qualitative results on the benchmark datasets, showing the performance of the computationally cheapest version of Pair-VPR (Pair-VPR-s). We provide three success examples and two failure cases per dataset.