Table of Contents
Fetching ...

Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching

Gongxin Yao, Xinyang Li, Luowei Fu, Yu Pan

TL;DR

An efficient framework to learn descriptors for both RGB images and point clouds is introduced and it takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning.

Abstract

Achieving monocular camera localization within pre-built LiDAR maps can bypass the simultaneous mapping process of visual SLAM systems, potentially reducing the computational overhead of autonomous localization. To this end, one of the key challenges is cross-modal place recognition, which involves retrieving 3D scenes (point clouds) from a LiDAR map according to online RGB images. In this paper, we introduce an efficient framework to learn descriptors for both RGB images and point clouds. It takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning. To address the field-of-view differences, independent descriptors are generated from multiple evenly distributed viewpoints for point clouds. A visible 3D points overlap strategy is then designed to quantify the similarity between point cloud views and RGB images for multi-view supervision. Additionally, when generating descriptors from pixel-level features using NetVLAD, we compensate for the loss of geometric information, and introduce an efficient scheme for multi-view generation. Experimental results on the KITTI and KITTI-360 datasets demonstrate the effectiveness and generalization of our method. The code will be released upon acceptance.

Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching

TL;DR

An efficient framework to learn descriptors for both RGB images and point clouds is introduced and it takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning.

Abstract

Achieving monocular camera localization within pre-built LiDAR maps can bypass the simultaneous mapping process of visual SLAM systems, potentially reducing the computational overhead of autonomous localization. To this end, one of the key challenges is cross-modal place recognition, which involves retrieving 3D scenes (point clouds) from a LiDAR map according to online RGB images. In this paper, we introduce an efficient framework to learn descriptors for both RGB images and point clouds. It takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning. To address the field-of-view differences, independent descriptors are generated from multiple evenly distributed viewpoints for point clouds. A visible 3D points overlap strategy is then designed to quantify the similarity between point cloud views and RGB images for multi-view supervision. Additionally, when generating descriptors from pixel-level features using NetVLAD, we compensate for the loss of geometric information, and introduce an efficient scheme for multi-view generation. Experimental results on the KITTI and KITTI-360 datasets demonstrate the effectiveness and generalization of our method. The code will be released upon acceptance.
Paper Structure (15 sections, 14 equations, 6 figures, 6 tables)

This paper contains 15 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of the cross-modal place recognition, which enables a robot with monocular camera to localize within a pre-built LiDAR map. We generate a global descriptor for online RGB image while independent descriptors from multiple viewpoints for point clouds. It bridges the modality gaps and reduces the FOV differences, improving retrieval accuracy.
  • Figure 2: Illustrations of VMamba and our cross-modal framework. (a) VMamba converts 2D visual tokens to 1D sequences for the state space model using a four-way scanning mechanism. (b) Our framework builds VMamba-based Pyramids for RGB images and point clouds, respectively. Point clouds are converted to 360° range images via spherical projection, offering a unified 2D format. After extracting pixel features from both modalities, the improved NetVLAD (denoted as xNetVLAD) generates a global descriptor for RGB images, and efficiently generates multi-view descriptors for 360° range images. We perform pixel-view-scene joint contrastive learning to train the dual-pyramid model.
  • Figure 3: Limitations of vanilla NetVLAD. (a) A toy example to demonstrate the geometric information loss. Colorful shapes represent different semantic objects. The residual summation operator $\mathcal{F}_{v}$ in Eq. \ref{['eq3:residual']} generates the same feature vector for the two scenes with same semantics while different geometries. (b) Two real-world scenes with similar semantics, both containing road, cars, buildings, and vegetation. The bottom two figures are the features generated by NetVLAD and convolution. Blue and red represent different scenes. The x-axis and y-axis represent indices and values, respectively.
  • Figure 4: The workflow to quantify the similarity between RGB image $\boldsymbol{I}_{t_1}$ and the multiple views of point cloud $\boldsymbol{P}_{t_2}$. We use it solely for training.
  • Figure 5: The Recall@N (ranging from 0 to 50) results on KITTI.
  • ...and 1 more figures