Table of Contents
Fetching ...

Regressing Transformers for Data-efficient Visual Place Recognition

María Leyva-Vallina, Nicola Strisciuglio, Nicolai Petkov

TL;DR

The paper tackles visual place recognition (VPR) by addressing label noise and reliance on re-ranking through a regression formulation that uses graded field-of-view overlap as ground-truth similarity. It trains descriptors via a siamese architecture (notably Vision Transformers) with a mean-squared error loss so that descriptor distance directly reflects image similarity, eliminating the need for hard-pair mining or re-ranking. The approach achieves competitive or superior recall on MSLS, Pittsburgh30k, and Tokyo24/7 with strong data efficiency, requiring only a few thousand training pairs to converge. Attention analyses and lower KL divergence between distance distributions and ground-truth similarity corroborate that regression-focused descriptors capture stable, relevent visual cues for robust ranking and generalize well across datasets. The work thus offers a simpler, energy-efficient VPR pipeline with strong ranking capabilities and practical impact for scalable localization systems.

Abstract

Visual place recognition is a critical task in computer vision, especially for localization and navigation systems. Existing methods often rely on contrastive learning: image descriptors are trained to have small distance for similar images and larger distance for dissimilar ones in a latent space. However, this approach struggles to ensure accurate distance-based image similarity representation, particularly when training with binary pairwise labels, and complex re-ranking strategies are required. This work introduces a fresh perspective by framing place recognition as a regression problem, using camera field-of-view overlap as similarity ground truth for learning. By optimizing image descriptors to align directly with graded similarity labels, this approach enhances ranking capabilities without expensive re-ranking, offering data-efficient training and strong generalization across several benchmark datasets.

Regressing Transformers for Data-efficient Visual Place Recognition

TL;DR

The paper tackles visual place recognition (VPR) by addressing label noise and reliance on re-ranking through a regression formulation that uses graded field-of-view overlap as ground-truth similarity. It trains descriptors via a siamese architecture (notably Vision Transformers) with a mean-squared error loss so that descriptor distance directly reflects image similarity, eliminating the need for hard-pair mining or re-ranking. The approach achieves competitive or superior recall on MSLS, Pittsburgh30k, and Tokyo24/7 with strong data efficiency, requiring only a few thousand training pairs to converge. Attention analyses and lower KL divergence between distance distributions and ground-truth similarity corroborate that regression-focused descriptors capture stable, relevent visual cues for robust ranking and generalize well across datasets. The work thus offers a simpler, energy-efficient VPR pipeline with strong ranking capabilities and practical impact for scalable localization systems.

Abstract

Visual place recognition is a critical task in computer vision, especially for localization and navigation systems. Existing methods often rely on contrastive learning: image descriptors are trained to have small distance for similar images and larger distance for dissimilar ones in a latent space. However, this approach struggles to ensure accurate distance-based image similarity representation, particularly when training with binary pairwise labels, and complex re-ranking strategies are required. This work introduces a fresh perspective by framing place recognition as a regression problem, using camera field-of-view overlap as similarity ground truth for learning. By optimizing image descriptors to align directly with graded similarity labels, this approach enhances ranking capabilities without expensive re-ranking, offering data-efficient training and strong generalization across several benchmark datasets.
Paper Structure (12 sections, 1 equation, 5 figures, 3 tables)

This paper contains 12 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A reference image (leftmost) and four match images, taken at different distances. The larger the distance w.r.t the reference image, the lower the annotated similarity ground truth $\psi$, and the smaller the amount of shared visual features.
  • Figure 2: Retrieval performance every 10k training iterations on the (a) MSLS-val, (b) Pitts30k, and (c) Tokyo24/7 for the same ViT-R50 encoder trained with CL, GCL, and MSE loss functions. We ran each experiment three times and report the average, minimum and maximum R@5.
  • Figure 3: Example attention maps on the last layer of ViT-R50-MSe and ViT-R50-GCL models for pairs of similar images from the MSLS validation dataset (columns 1-2), and the Tokyo24/7 dataset (columns 3-4).
  • Figure 4: Results obtained on the MSLS validation, MSLS test, Pittsburgh 30k and Tokyo 24/7 datasets by MSE-trained models with and without PCA whitening. Reducing the dimensionality of the descriptors and applying the whitening transform contribute to an increase of the retrieval performance.
  • Figure 5: Example of the covariance matrices of features (from the MSLS validation set) learned with MSE and contrastive losses.