Table of Contents
Fetching ...

SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning

Yuhao Li, Jianping Li, Zhen Dong, Yuan Wang, Bisheng Yang

TL;DR

SaliencyI2PLoc tackles cross-modality image-to-point-cloud global localization by fusing saliency-guided local features within a dual-transformer framework and optimizing with a contrastive learning objective augmented by multi-manifold feature relation consistency. The method avoids heavy data mining, directly maps image and point-cloud features into a shared space, and uses saliency-aware NetVLAD for robust global descriptors. Empirical results on KITTI-360/KITTI demonstrate substantial improvements in urban localization (Recall@1 up to 78.92% and Recall@20 up to 97.59%), with strong generalization to unseen data and clear ablation-backed gains from saliency fusion and multi-manifold constraints. The work advances cross-modality localization in GNSS-denied environments and lays groundwork for more scalable, multi-robot map fusion in urban settings.

Abstract

Image to point cloud global localization is crucial for robot navigation in GNSS-denied environments and has become increasingly important for multi-robot map fusion and urban asset management. The modality gap between images and point clouds poses significant challenges for cross-modality fusion. Current cross-modality global localization solutions either require modality unification, which leads to information loss, or rely on engineered training schemes to encode multi-modality features, which often lack feature alignment and relation consistency. To address these limitations, we propose, SaliencyI2PLoc, a novel contrastive learning based architecture that fuses the saliency map into feature aggregation and maintains the feature relation consistency on multi-manifold spaces. To alleviate the pre-process of data mining, the contrastive learning framework is applied which efficiently achieves cross-modality feature mapping. The context saliency-guided local feature aggregation module is designed, which fully leverages the contribution of the stationary information in the scene generating a more representative global feature. Furthermore, to enhance the cross-modality feature alignment during contrastive learning, the consistency of relative relationships between samples in different manifold spaces is also taken into account. Experiments conducted on urban and highway scenario datasets demonstrate the effectiveness and robustness of our method. Specifically, our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset, showing an improvement of 37.35% and 18.07%, compared to the baseline method. This demonstrates that our architecture efficiently fuses images and point clouds and represents a significant step forward in cross-modality global localization. The project page and code will be released.

SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning

TL;DR

SaliencyI2PLoc tackles cross-modality image-to-point-cloud global localization by fusing saliency-guided local features within a dual-transformer framework and optimizing with a contrastive learning objective augmented by multi-manifold feature relation consistency. The method avoids heavy data mining, directly maps image and point-cloud features into a shared space, and uses saliency-aware NetVLAD for robust global descriptors. Empirical results on KITTI-360/KITTI demonstrate substantial improvements in urban localization (Recall@1 up to 78.92% and Recall@20 up to 97.59%), with strong generalization to unseen data and clear ablation-backed gains from saliency fusion and multi-manifold constraints. The work advances cross-modality localization in GNSS-denied environments and lays groundwork for more scalable, multi-robot map fusion in urban settings.

Abstract

Image to point cloud global localization is crucial for robot navigation in GNSS-denied environments and has become increasingly important for multi-robot map fusion and urban asset management. The modality gap between images and point clouds poses significant challenges for cross-modality fusion. Current cross-modality global localization solutions either require modality unification, which leads to information loss, or rely on engineered training schemes to encode multi-modality features, which often lack feature alignment and relation consistency. To address these limitations, we propose, SaliencyI2PLoc, a novel contrastive learning based architecture that fuses the saliency map into feature aggregation and maintains the feature relation consistency on multi-manifold spaces. To alleviate the pre-process of data mining, the contrastive learning framework is applied which efficiently achieves cross-modality feature mapping. The context saliency-guided local feature aggregation module is designed, which fully leverages the contribution of the stationary information in the scene generating a more representative global feature. Furthermore, to enhance the cross-modality feature alignment during contrastive learning, the consistency of relative relationships between samples in different manifold spaces is also taken into account. Experiments conducted on urban and highway scenario datasets demonstrate the effectiveness and robustness of our method. Specifically, our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset, showing an improvement of 37.35% and 18.07%, compared to the baseline method. This demonstrates that our architecture efficiently fuses images and point clouds and represents a significant step forward in cross-modality global localization. The project page and code will be released.

Paper Structure

This paper contains 29 sections, 15 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Overview of general cross-modality global localization. Given a point cloud map and a query image, the cross-modality localization task aims to retrieve the most closet or similar pre-built point cloud submaps.
  • Figure 2: The architecture of SaliencyI2PLoc. SaliencyI2PLoc encodes the input image-point cloud pairs into a high-dimensional feature embedding space using a feature encoder (ViT for images, mini-PointNet combined with Transformer for point clouds) and feature aggregator (saliency map boosted NetVLAD layer). It then achieves feature fusion and alignment through the contrastive learning loss function that incorporates cross-modal feature relationship consistency constraints.
  • Figure 3: Visualization of farthest point sampling (FPS) and grouping process, and the architecture of 3d tokenizer. Numbers in the bracket are layer sizes for the Multi-layer perceptron (MLP). Batchnorm is used for all layers with ReLU. 3D tokenizer takes in local patch point clouds and returns a $D_{3d}$-dimensional feature.
  • Figure 4: Illustration of image patchifying and the salient attention map. (a) The patches of the input images, where the invalid areas which provide less information are highlighted by red ellipses. (b) The activation maps from the last Transformer blocks of the visual encoder are back-projected into input images, where the lighter inflects higher attention scores.
  • Figure 5: The pipeline of saliency-guided NetVLAD layer. The red arrow indicates the position where the saliency score is applied, while the purple blocks are the vanilla NetVLAD layer.
  • ...and 11 more figures