Table of Contents
Fetching ...

Explicit Interaction for Fusion-Based Place Recognition

Jingyi Xu, Junyi Ma, Qi Wu, Zijie Zhou, Yue Wang, Xieyuanli Chen, Ling Pei

TL;DR

This work tackles GPS-denied place recognition by enabling explicit interaction between LiDAR and camera modalities in a fusion framework. It introduces EINet, a dual-branch architecture that uses LiDAR-derived sparse depth supervision to guide camera features and camera-derived appearance to color LiDAR points, all fused through a cross-modal transformer to produce robust global descriptors. A new benchmark, NUSC-PR, based on nuScenes, supports both supervised and self-supervised training with standardized evaluation protocols. Experiments show EINet outperforms state-of-the-art fusion-based methods, demonstrates strong generalization to unseen locations, and maintains online inference efficiency, with open-source code and benchmark released for future research.

Abstract

Fusion-based place recognition is an emerging technique jointly utilizing multi-modal perception data, to recognize previously visited places in GPS-denied scenarios for robots and autonomous vehicles. Recent fusion-based place recognition methods combine multi-modal features in implicit manners. While achieving remarkable results, they do not explicitly consider what the individual modality affords in the fusion system. Therefore, the benefit of multi-modal feature fusion may not be fully explored. In this paper, we propose a novel fusion-based network, dubbed EINet, to achieve explicit interaction of the two modalities. EINet uses LiDAR ranges to supervise more robust vision features for long time spans, and simultaneously uses camera RGB data to improve the discrimination of LiDAR point clouds. In addition, we develop a new benchmark for the place recognition task based on the nuScenes dataset. To establish this benchmark for future research with comprehensive comparisons, we introduce both supervised and self-supervised training schemes alongside evaluation protocols. We conduct extensive experiments on the proposed benchmark, and the experimental results show that our EINet exhibits better recognition performance as well as solid generalization ability compared to the state-of-the-art fusion-based place recognition approaches. Our open-source code and benchmark are released at: https://github.com/BIT-XJY/EINet.

Explicit Interaction for Fusion-Based Place Recognition

TL;DR

This work tackles GPS-denied place recognition by enabling explicit interaction between LiDAR and camera modalities in a fusion framework. It introduces EINet, a dual-branch architecture that uses LiDAR-derived sparse depth supervision to guide camera features and camera-derived appearance to color LiDAR points, all fused through a cross-modal transformer to produce robust global descriptors. A new benchmark, NUSC-PR, based on nuScenes, supports both supervised and self-supervised training with standardized evaluation protocols. Experiments show EINet outperforms state-of-the-art fusion-based methods, demonstrates strong generalization to unseen locations, and maintains online inference efficiency, with open-source code and benchmark released for future research.

Abstract

Fusion-based place recognition is an emerging technique jointly utilizing multi-modal perception data, to recognize previously visited places in GPS-denied scenarios for robots and autonomous vehicles. Recent fusion-based place recognition methods combine multi-modal features in implicit manners. While achieving remarkable results, they do not explicitly consider what the individual modality affords in the fusion system. Therefore, the benefit of multi-modal feature fusion may not be fully explored. In this paper, we propose a novel fusion-based network, dubbed EINet, to achieve explicit interaction of the two modalities. EINet uses LiDAR ranges to supervise more robust vision features for long time spans, and simultaneously uses camera RGB data to improve the discrimination of LiDAR point clouds. In addition, we develop a new benchmark for the place recognition task based on the nuScenes dataset. To establish this benchmark for future research with comprehensive comparisons, we introduce both supervised and self-supervised training schemes alongside evaluation protocols. We conduct extensive experiments on the proposed benchmark, and the experimental results show that our EINet exhibits better recognition performance as well as solid generalization ability compared to the state-of-the-art fusion-based place recognition approaches. Our open-source code and benchmark are released at: https://github.com/BIT-XJY/EINet.
Paper Structure (14 sections, 5 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 14 sections, 5 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: Different from existing implicit fusion methods, our proposed method leverages the strengths of the LiDAR modality (robustness from range) and camera modality (richness from appearance) using explicit interaction for generating place descriptors.
  • Figure 2: Pipeline overview of our proposed EINet. The purple and blue arrows represent the camera branches and the LiDAR branches respectively. The camera sensors provide RGB information for rendered point clouds and range images in the LiDAR branch (dotted green arrow), while the LiDAR sensor helps to supervise pseudo depth in the camera branch (dotted orange arrow), achieving explicit interaction in the fusion-based place recognition framework.
  • Figure 3: An example of the raw camera images, pseudo depth maps, raw LiDAR range image, and rendered range image in explicit interaction.
  • Figure 4: Evaluation of place recognition performance in the self-supervised learning scheme on the BS split.