Table of Contents
Fetching ...

UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Yiming Ma, Guangming Xiong

TL;DR

UniMPR tackles GPS-denied place recognition by unifying heterogeneous sensor data into a polar BEV space and processing it with a three-branch (camera, LiDAR, radar) plus fusion MoE Transformer architecture. A learnable BEV imputation module and a two-stage training regime on a large unified multimodal dataset enable robust performance under missing modalities and across diverse configurations, achieving state-of-the-art results across seven public datasets and a self-collected one. Key contributions include the polar BEV representation, adaptive label assignment for cross-modality consistency, and strong zero-shot generalization to unseen environments and sensor setups. The approach promises practical impact for robust, flexible localization in autonomous systems with varying hardware and environmental conditions.

Abstract

Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network's generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at https://github.com/QiZS-BIT/UniMPR.

UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations

TL;DR

UniMPR tackles GPS-denied place recognition by unifying heterogeneous sensor data into a polar BEV space and processing it with a three-branch (camera, LiDAR, radar) plus fusion MoE Transformer architecture. A learnable BEV imputation module and a two-stage training regime on a large unified multimodal dataset enable robust performance under missing modalities and across diverse configurations, achieving state-of-the-art results across seven public datasets and a self-collected one. Key contributions include the polar BEV representation, adaptive label assignment for cross-modality consistency, and strong zero-shot generalization to unseen environments and sensor setups. The approach promises practical impact for robust, flexible localization in autonomous systems with varying hardware and environmental conditions.

Abstract

Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network's generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at https://github.com/QiZS-BIT/UniMPR.

Paper Structure

This paper contains 40 sections, 7 equations, 9 figures, 24 tables.

Figures (9)

  • Figure 1: UniMPR unifies diverse sensor inputs for multimodal place recognition, and provides adaptability across various modality combinations and heterogeneous sensor configurations.
  • Figure 2: The overview of our proposed UniMPR. It first unifies heterogeneous data from different modalities within a polar BEV coordinate. The resulting polar BEVs are then fed into a multi-branch network. Within this network, the three modality-specific branches are designed to extract features from individual modalities, while a dedicated fusion branch learns cross-modal feature interactions. The learnable BEV imputation module is proposed to supply feature tokens for any missing modalities, thereby reducing the dependence of multimodal fusion on specific modality combinations.
  • Figure 3: The Gaussian BEV Projection module. Features are lifted into 3D space via projection. We then construct 3D Gaussians from the sampled features and convert them into a polar BEV representation by orthogonal Gaussian splatting.
  • Figure 4: The proposed two-stage training pipeline. we simultaneously achieve adequate exploration of both inter-modal and intra-modal feature correlations, thereby enhancing the model's overall performance and robustness to missing modalities.
  • Figure 5: The proposed adaptive label assignment strategy. This strategy avoids conflicts in sample definitions by designing distinct label assignment rules for modalities with different fields of view.
  • ...and 4 more figures