Table of Contents
Fetching ...

GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving

Zhangshuo Qi, Junyi Ma, Jingyi Xu, Zijie Zhou, Luqi Cheng, Guangming Xiong

TL;DR

GSPR addresses robust place recognition in GPS-denied driving by explicitly fusing multi-view RGB images and LiDAR into a 3D Gaussian Splatting (3D-GS) scene. The method introduces Multimodal Gaussian Splatting (MGS) to create a unified spatio-temporal Gaussian scene, and a Global Descriptor Generator (GDG) based on 3D graph convolution and transformer to extract discriminative descriptors, trained with a two-stage scheme using L1 and SSIM losses for reconstruction and a lazy triplet loss for descriptors. Empirical results on nuScenes, KITTI, and KITTI-360 show state-of-the-art performance and strong generalization, with a lighter GSPR-L variant offering speed-accuracy trade-offs. The work demonstrates that explicit, interpretable cross-modal fusion via Gaussian representations can outperform traditional descriptor fusion approaches in challenging outdoor, autonomous-driving scenarios.

Abstract

Place recognition is a crucial component that enables autonomous vehicles to obtain localization results in GPS-denied environments. In recent years, multimodal place recognition methods have gained increasing attention. They overcome the weaknesses of unimodal sensor systems by leveraging complementary information from different modalities. However, most existing methods explore cross-modality correlations through feature-level or descriptor-level fusion, suffering from a lack of interpretability. Conversely, the recently proposed 3D Gaussian Splatting provides a new perspective on multimodal fusion by harmonizing different modalities into an explicit scene representation. In this paper, we propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GSPR. It explicitly combines multi-view RGB images and LiDAR point clouds into a spatio-temporally unified scene representation with the proposed Multimodal Gaussian Splatting. A network composed of 3D graph convolution and transformer is designed to extract spatio-temporal features and global descriptors from the Gaussian scenes for place recognition. Extensive evaluations on three datasets demonstrate that our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability. Our open-source code will be released at https://github.com/QiZS-BIT/GSPR.

GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving

TL;DR

GSPR addresses robust place recognition in GPS-denied driving by explicitly fusing multi-view RGB images and LiDAR into a 3D Gaussian Splatting (3D-GS) scene. The method introduces Multimodal Gaussian Splatting (MGS) to create a unified spatio-temporal Gaussian scene, and a Global Descriptor Generator (GDG) based on 3D graph convolution and transformer to extract discriminative descriptors, trained with a two-stage scheme using L1 and SSIM losses for reconstruction and a lazy triplet loss for descriptors. Empirical results on nuScenes, KITTI, and KITTI-360 show state-of-the-art performance and strong generalization, with a lighter GSPR-L variant offering speed-accuracy trade-offs. The work demonstrates that explicit, interpretable cross-modal fusion via Gaussian representations can outperform traditional descriptor fusion approaches in challenging outdoor, autonomous-driving scenarios.

Abstract

Place recognition is a crucial component that enables autonomous vehicles to obtain localization results in GPS-denied environments. In recent years, multimodal place recognition methods have gained increasing attention. They overcome the weaknesses of unimodal sensor systems by leveraging complementary information from different modalities. However, most existing methods explore cross-modality correlations through feature-level or descriptor-level fusion, suffering from a lack of interpretability. Conversely, the recently proposed 3D Gaussian Splatting provides a new perspective on multimodal fusion by harmonizing different modalities into an explicit scene representation. In this paper, we propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GSPR. It explicitly combines multi-view RGB images and LiDAR point clouds into a spatio-temporally unified scene representation with the proposed Multimodal Gaussian Splatting. A network composed of 3D graph convolution and transformer is designed to extract spatio-temporal features and global descriptors from the Gaussian scenes for place recognition. Extensive evaluations on three datasets demonstrate that our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability. Our open-source code will be released at https://github.com/QiZS-BIT/GSPR.
Paper Structure (15 sections, 9 equations, 6 figures, 4 tables)

This paper contains 15 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Effectively integrating different modalities is crucial for leveraging multimodal data. GSPR harmonizes multi-view RGB images and LiDAR point clouds into a unified scene representation based on Multimodal Gaussian Splatting. 3D graph convolution and transformer are utilized to extract both local and global spatio-temporal information embedded in the scene, ultimately generating discriminative descriptors.
  • Figure 2: The overall architecture of GSPR. Multimodal Gaussian Splatting employs strategies including LiDAR-based Gaussian initialization and mixed masking mechanism to fuse LiDAR and camera data into a spatio-temporal unified MGS scene representation. The Global Descriptor Generator voxelizes the MGS scene representation and employs 3D graph convolution and transformer to extract high-level local and global spatio-temporal features embedded within the scene. Finally, the high-level spatio-temporal features are aggregated into place recognition descriptors using NetVLAD-MLPs combos.
  • Figure 3: The Multimodal Gaussian Splatting (MGS) pipeline initializes the Gaussians using processed LiDAR point clouds as prior information. RGB image sequences generate masks to guide Gaussian optimization through semantic segmentation and mixed masking. After iterative optimization, the multimodal data are integrated into a unified MGS scene representation.
  • Figure 4: A comparison of the rendering results between our MGS and the vanilla 3D-GS. The environmental features of lesser significance for place recognition are masked, while the integration of LiDAR prior enhances the geometric accuracy of explicit scene reconstruction.
  • Figure 5: The detailed architecture of transformer module. Feature coordinates are explicitly encoded as positional embeddings and fused with features through graph convolutions. A transformer attention is used to extract global context from the features.
  • ...and 1 more figures