Table of Contents
Fetching ...

CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration

Dongxu Zhang, Yingsen Wang, Yiding Sun, Haoran Xu, Peilin Fan, Jihua Zhu

Abstract

Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model's robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model's generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}.

CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration

Abstract

Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model's robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model's generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}.
Paper Structure (19 sections, 17 equations, 7 figures, 8 tables)

This paper contains 19 sections, 17 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The pixel-to-point correspondence between the point cloud and the image has been established by external parameter calibration, so this method is reasonable, and can effectively combine the global texture features of the image with the local geometric characteristics of the point cloud.
  • Figure 2: The Registration Recall (RR) is plotted on the x-axis for 3DMatch and on the y-axis for 3DLoMatch. CMHANet stands out by consistently achieving the highest RR.
  • Figure 3: The workflow overview of CMHANet. Our method processes raw point clouds and images through feature extraction, employs a hybrid-attention mechanism for superpoint matching, refines to dense point correspondences, and finally computes the rigid transformation for alignment.
  • Figure 4: Details of the superpoint matching module. This module employs a multi-stage attention mechanism: Self-Attention refines individual modality features, followed by Aggregation-Attention for structured intra-modal feature integration, and finally Cross-Attention to fuse features across modalities, ultimately generating a similarity matrix for robust superpoint matching. For visualization, the blue blocks represent the source point cloud features ($\hat{F}^P$), the orange blocks represent the target point cloud features ($\hat{F}^Q$), and the yellow lines indicate the attention interactions.
  • Figure 5: Computation graph of the Hybrid Attention mechanism. In the diagrams, the label $F$ denotes feature representations ($e.g.,$$\hat{F}^{PC}$ for generic point cloud features, $\hat{F}^P$ and $\hat{F}^Q$ for source and target point cloud features (in a, c), $\hat{F}^I$ for image features), and $E$ represents geometric positional embeddings. The three modules serve distinct roles. (a) Geometric Self-Attention captures global structural relationships within a single point cloud; (b) Geometric Aggregation-Attention fuses 2D visual context from images into 3D geometric features; and (c) Geometric Cross-Attention establishes consistency and searches for correspondences between the source and target point clouds.
  • ...and 2 more figures