Table of Contents
Fetching ...

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

Zijie Zhou, Jingyi Xu, Guangming Xiong, Junyi Ma

TL;DR

LCPR tackles place recognition in GPS-denied settings by fusing LiDAR range images with multi-view RGB imagery to produce yaw-rotation invariant, discriminative global descriptors. It introduces a Vertically Compressed Transformer Fusion module that fuses features across scales and modalities, complemented by residual encoders and NetVLAD-MLP aggregations to generate compact descriptors. The approach achieves state-of-the-art performance on nuScenes, demonstrates robustness to occlusion and lighting changes, and preserves real-time inference capabilities suitable for in-vehicle deployment. The work advances multimodal place recognition by exploiting panoramic views and cross-modal attention, with practical implications for reliable loop closure and global localization in autonomous driving.

Abstract

Place recognition is one of the most crucial modules for autonomous vehicles to identify places that were previously visited in GPS-invalid environments. Sensor fusion is considered an effective method to overcome the weaknesses of individual sensors. In recent years, multimodal place recognition fusing information from multiple sensors has gathered increasing attention. However, most existing multimodal place recognition methods only use limited field-of-view camera images, which leads to an imbalance between features from different modalities and limits the effectiveness of sensor fusion. In this paper, we present a novel neural network named LCPR for robust multimodal place recognition, which fuses LiDAR point clouds with multi-view RGB images to generate discriminative and yaw-rotation invariant representations of the environment. A multi-scale attention-based fusion module is proposed to fully exploit the panoramic views from different modalities of the environment and their correlations. We evaluate our method on the nuScenes dataset, and the experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve the place recognition performance while maintaining strong robustness to viewpoint changes. Our open-source code and pre-trained models are available at https://github.com/ZhouZijie77/LCPR .

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

TL;DR

LCPR tackles place recognition in GPS-denied settings by fusing LiDAR range images with multi-view RGB imagery to produce yaw-rotation invariant, discriminative global descriptors. It introduces a Vertically Compressed Transformer Fusion module that fuses features across scales and modalities, complemented by residual encoders and NetVLAD-MLP aggregations to generate compact descriptors. The approach achieves state-of-the-art performance on nuScenes, demonstrates robustness to occlusion and lighting changes, and preserves real-time inference capabilities suitable for in-vehicle deployment. The work advances multimodal place recognition by exploiting panoramic views and cross-modal attention, with practical implications for reliable loop closure and global localization in autonomous driving.

Abstract

Place recognition is one of the most crucial modules for autonomous vehicles to identify places that were previously visited in GPS-invalid environments. Sensor fusion is considered an effective method to overcome the weaknesses of individual sensors. In recent years, multimodal place recognition fusing information from multiple sensors has gathered increasing attention. However, most existing multimodal place recognition methods only use limited field-of-view camera images, which leads to an imbalance between features from different modalities and limits the effectiveness of sensor fusion. In this paper, we present a novel neural network named LCPR for robust multimodal place recognition, which fuses LiDAR point clouds with multi-view RGB images to generate discriminative and yaw-rotation invariant representations of the environment. A multi-scale attention-based fusion module is proposed to fully exploit the panoramic views from different modalities of the environment and their correlations. We evaluate our method on the nuScenes dataset, and the experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve the place recognition performance while maintaining strong robustness to viewpoint changes. Our open-source code and pre-trained models are available at https://github.com/ZhouZijie77/LCPR .
Paper Structure (18 sections, 8 equations, 6 figures, 3 tables)

This paper contains 18 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Multi-view images can provide full perspective information of the environment as LiDAR does. LCPR leverages multi-view RGB images and range image from LiDAR as inputs, and utilizes transformer attention to identify the correspondence between two modalities. Localization can be achieved by searching for the nearest neighbor in the database.
  • Figure 2: The overall architecture of LCPR. Multi-view RGB images and one range image are fed into the sibling image encoding (IE) branch and the LiDAR encoding (LE) branch to obtain intermediate features at different resolutions. A set of Vertically Compressed Transformer Fusion (VCTF) modules are employed to fuse these intermediate features at multiple scales. The outputs of the IE and LE branches then pass through the Vertical Compression (VC) layers to obtain the denser panoramic features, which are aggregated using NetVLAD-MLPs combos to generate sub-descriptors. Finally, the global multimodal descriptor is generated by the concatenation of the sub-descriptors, which is further used as a query or reference in the database.
  • Figure 3: The architecture of Vertically Compressed Transformer Fusion (VCTF) module. The intermediate feature volumes from the two sibling feature streams are first compressed in the vertical direction. The compressed sentence-like features are concatenated horizontally, and fed into the MHSA module for multimodal fusion. The fused feature is then split and expanded to be sent back to the original feature streams.
  • Figure 4: Place recognition results on the SHV split.
  • Figure 5: Place recognition results on the BS split when manually adjusting image brightness.
  • ...and 1 more figures