Table of Contents
Fetching ...

Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments

Alberto García-Hernández, Riccardo Giubilato, Klaus H. Strobl, Javier Civera, Rudolph Triebel

TL;DR

A novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that leverages multi-modality by cross-attention blocks between vision and LiDAR features, and includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation.

Abstract

Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF

Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments

TL;DR

A novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that leverages multi-modality by cross-attention blocks between vision and LiDAR features, and includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation.

Abstract

Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF
Paper Structure (10 sections, 3 equations, 7 figures, 5 tables)

This paper contains 10 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) The LRU rover traversing the Moon-analogue environment of Mt. Etna, Sicily, recording the DLR Planetary Stereo Solid-State LiDAR Inertial (S3LI) dataset etna. (b) Aligned visual and 3D LiDAR data. Note the challenging texture and geometry for place recognition.
  • Figure 2: UMF overview. Each branch encodes each of the inputs independently. The encodings of the individual modalities are fused by self- and cross-attention modules into a single global multimodal representation. For each individual data modality, separate branches extract also local features. During inference, we query a database of places with the global multimodal descriptor using a K-Dimensional Tree, the top-$k$ candidates are retrieved via NN-search, and finally they are re-ranked using local features from both modalities. This last stage is the main contribution of our paper.
  • Figure 3: Local Super-features extracted with the LIT module for both modalities. Attention maps show the areas where each Super-feature is focused on.
  • Figure 4: RANSAC local branch visualization. The resulting attention maps are used to select the salient features.
  • Figure 5: Illustration of our pre-training on RobotCar. First, masked inputs are encoded by $f$, followed by the densification process in the decoder $g$. After pre-training, only the encoder $f$ is used for downstream tasks.
  • ...and 2 more figures