Table of Contents
Fetching ...

Range and Bird's Eye View Fused Cross-Modal Visual Place Recognition

Jianyi Peng, Fan Lu, Bin Li, Yuan Huang, Sanqing Qu, Guang Chen

TL;DR

This work tackles image-to-point cloud cross-modal Visual Place Recognition (VPR) by introducing a two-stage retrieval framework that first uses global descriptors from range (or RGB) images and then re-ranks with BEV images, all without intermediate feature matching. It introduces a novel similarity label supervision based on points average distance $D_{avg}$ and an adaptive-margin generalized triplet loss, enabling robust learning from limited data. The method integrates four global-descriptor streams (RGB, range, camera BEV, LiDAR BEV) via a two-phase pipeline and a BEV-based re-ranking strategy, achieving state-of-the-art results on KITTI. Overall, the approach effectively bridges RGB-LiDAR modality gaps and offers practical gains for scalable cross-modal VPR with efficient inference.

Abstract

Image-to-point cloud cross-modal Visual Place Recognition (VPR) is a challenging task where the query is an RGB image, and the database samples are LiDAR point clouds. Compared to single-modal VPR, this approach benefits from the widespread availability of RGB cameras and the robustness of point clouds in providing accurate spatial geometry and distance information. However, current methods rely on intermediate modalities that capture either the vertical or horizontal field of view, limiting their ability to fully exploit the complementary information from both sensors. In this work, we propose an innovative initial retrieval + re-rank method that effectively combines information from range (or RGB) images and Bird's Eye View (BEV) images. Our approach relies solely on a computationally efficient global descriptor similarity search process to achieve re-ranking. Additionally, we introduce a novel similarity label supervision technique to maximize the utility of limited training data. Specifically, we employ points average distance to approximate appearance similarity and incorporate an adaptive margin, based on similarity differences, into the vanilla triplet loss. Experimental results on the KITTI dataset demonstrate that our method significantly outperforms state-of-the-art approaches.

Range and Bird's Eye View Fused Cross-Modal Visual Place Recognition

TL;DR

This work tackles image-to-point cloud cross-modal Visual Place Recognition (VPR) by introducing a two-stage retrieval framework that first uses global descriptors from range (or RGB) images and then re-ranks with BEV images, all without intermediate feature matching. It introduces a novel similarity label supervision based on points average distance and an adaptive-margin generalized triplet loss, enabling robust learning from limited data. The method integrates four global-descriptor streams (RGB, range, camera BEV, LiDAR BEV) via a two-phase pipeline and a BEV-based re-ranking strategy, achieving state-of-the-art results on KITTI. Overall, the approach effectively bridges RGB-LiDAR modality gaps and offers practical gains for scalable cross-modal VPR with efficient inference.

Abstract

Image-to-point cloud cross-modal Visual Place Recognition (VPR) is a challenging task where the query is an RGB image, and the database samples are LiDAR point clouds. Compared to single-modal VPR, this approach benefits from the widespread availability of RGB cameras and the robustness of point clouds in providing accurate spatial geometry and distance information. However, current methods rely on intermediate modalities that capture either the vertical or horizontal field of view, limiting their ability to fully exploit the complementary information from both sensors. In this work, we propose an innovative initial retrieval + re-rank method that effectively combines information from range (or RGB) images and Bird's Eye View (BEV) images. Our approach relies solely on a computationally efficient global descriptor similarity search process to achieve re-ranking. Additionally, we introduce a novel similarity label supervision technique to maximize the utility of limited training data. Specifically, we employ points average distance to approximate appearance similarity and incorporate an adaptive margin, based on similarity differences, into the vanilla triplet loss. Experimental results on the KITTI dataset demonstrate that our method significantly outperforms state-of-the-art approaches.

Paper Structure

This paper contains 17 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of our image-to-point cloud cross-modal visual place recognition. It's mainly composed of two separate similarity search process by only using global descriptors, in this way, we can effectively combine the information from range (or RGB) images and Bird's Eye View (BEV) images, significantly reducing the modality gap.
  • Figure 2: The training pipeline to produce the range (or RGB) and BEV descriptors. raw data (e.g., LiDAR point cloud, camera RGB image) are preprocessed to reduce modality differences and improve the overlap in visual content. Featrue maps are generated by extracting features from RGB, LiDAR range, camera BEV and LiDAR BEV, which are latter aggregated by the Generalized Mean (GeM) pooling to abtain global descriptors. It's worth noting that we use the points average distance together with a generalized triplet loss to supervise the learning process and fully utilize the limited training data
  • Figure 3: Examples of three submaps with similarity value in a decreasing order. RGB image and LiDAR Point Cloud is cropped to maximize the visual content overlapping. In the third column, red lines connecting corresponding blue and green points indicate the points distances, which are later processed for obtaining the final similarity value.
  • Figure 4: The initial retrieval + re-rank pipeline. In the two-phase similarity search, global descriptors with higher similarity, indicated by closer similarity in color, are ranked higher. By combining the rankings from both phases, we improve the precision of retrieval. This results in true positive samples being ranked higher (indicated by global descriptors in green boxes) and false positive samples being ranked lower (indicated by global descriptors in red boxes).
  • Figure 5: Examples of initial retrieval and re-rank results on KITTI dataset.
  • ...and 2 more figures