Table of Contents
Fetching ...

VDNA-PR: Using General Dataset Representations for Robust Sequential Visual Place Recognition

Benjamin Ramtoula, Daniele De Martini, Matthew Gadd, Paul Newman

TL;DR

A general dataset representation technique is adapted to produce robust Visual Place Recognition (VPR) descriptors by learning a very lightweight and simple encoder to generate task-specific descriptors, crucial to enable real-world mobile robot localisation.

Abstract

This paper adapts a general dataset representation technique to produce robust Visual Place Recognition (VPR) descriptors, crucial to enable real-world mobile robot localisation. Two parallel lines of work on VPR have shown, on one side, that general-purpose off-the-shelf feature representations can provide robustness to domain shifts, and, on the other, that fused information from sequences of images improves performance. In our recent work on measuring domain gaps between image datasets, we proposed a Visual Distribution of Neuron Activations (VDNA) representation to represent datasets of images. This representation can naturally handle image sequences and provides a general and granular feature representation derived from a general-purpose model. Moreover, our representation is based on tracking neuron activation values over the list of images to represent and is not limited to a particular neural network layer, therefore having access to high- and low-level concepts. This work shows how VDNAs can be used for VPR by learning a very lightweight and simple encoder to generate task-specific descriptors. Our experiments show that our representation can allow for better robustness than current solutions to serious domain shifts away from the training data distribution, such as to indoor environments and aerial imagery.

VDNA-PR: Using General Dataset Representations for Robust Sequential Visual Place Recognition

TL;DR

A general dataset representation technique is adapted to produce robust Visual Place Recognition (VPR) descriptors by learning a very lightweight and simple encoder to generate task-specific descriptors, crucial to enable real-world mobile robot localisation.

Abstract

This paper adapts a general dataset representation technique to produce robust Visual Place Recognition (VPR) descriptors, crucial to enable real-world mobile robot localisation. Two parallel lines of work on VPR have shown, on one side, that general-purpose off-the-shelf feature representations can provide robustness to domain shifts, and, on the other, that fused information from sequences of images improves performance. In our recent work on measuring domain gaps between image datasets, we proposed a Visual Distribution of Neuron Activations (VDNA) representation to represent datasets of images. This representation can naturally handle image sequences and provides a general and granular feature representation derived from a general-purpose model. Moreover, our representation is based on tracking neuron activation values over the list of images to represent and is not limited to a particular neural network layer, therefore having access to high- and low-level concepts. This work shows how VDNAs can be used for VPR by learning a very lightweight and simple encoder to generate task-specific descriptors. Our experiments show that our representation can allow for better robustness than current solutions to serious domain shifts away from the training data distribution, such as to indoor environments and aerial imagery.
Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: VDNA-PR overview. As in other sequence-based works, we solve by building and matching representations for image sequences along driven trajectories. We rely on vdna representations ramtoula2023vdna, which were originally introduced to measure domain gaps between datasets. They consist of histograms that describe activations observed when passing images through a frozen self-supervised feature extractor F. Importantly, vdna keep track of activations for neurons throughout all layers of the network, keeping a general and granular multi-level representation. To generate more practical descriptors specifically for , we propose an encoder E to encode vdna into descriptors that can efficiently be compared with traditional techniques.
  • Figure 2: Overview of VDNA-PR training. As in \ref{['fig:overview']}, as a sequence of images passes through a pre-trained frozen feature extractor F, histograms tracking neuron-wise activations constitute a vdna . The histogram corresponding to each neuron has $500$ bins, and a small 1D CNN encoder E maps each histogram to a lower dimensional vector of length $4$ (with shared weights across the $9216$ neurons). The concatenation of these length-$4$ features is of length $36864$ and is then itself passed through a linear layer W to be reduced in dimension and to form the final representation. It is on this representation that we perform contrastive learning with triplet losses as is common in place recognition. At test-time on different domains, we remove the linear layer W which has learned specific features of the training domain, and use concatenations of encoded histogram features from selected neurons. With this training, we therefore learn neuron-wise descriptors that can be used and combined for .
  • Figure 3: Example images from datasets used in this study. We evaluate generalisation capability from training approaches on MSLS, and using them in other urban and indoor environments, as well as on aerial imagery.
  • Figure 4: Recall@1 when evaluating performance using VDNA-PR descriptors from each layer of DINOv2 for all datasets considered.