Table of Contents
Fetching ...

A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

Renlang Huang, Yufan Tang, Jiming Chen, Liang Li

TL;DR

This work addresses robust, real-time point cloud registration without pose priors by introducing CAST, a consistency-aware spot-guided Transformer. CAST enforces geometric consistency during coarse matching through spot-guided cross-attention and consistency-aware self-attention, and couples it with a lightweight sparse-to-dense fine matching module for efficient, accurate pose estimation. It also employs a compatibility-graph embedding to filter outliers without relying on heavy hypothesis-and-selection pipelines. Extensive experiments across outdoor LiDAR and indoor RGB-D benchmarks demonstrate state-of-the-art accuracy, robustness, and efficiency, with strong performance in real-time odometry scenarios.

Abstract

Deep learning-based feature matching has shown great superiority for point cloud registration in the absence of pose priors. Although coarse-to-fine matching approaches are prevalent, the coarse matching of existing methods is typically sparse and loose without consideration of geometric consistency, which makes the subsequent fine matching rely on ineffective optimal transport and hypothesis-and-selection methods for consistency. Therefore, these methods are neither efficient nor scalable for real-time applications such as odometry in robotics. To address these issues, we design a consistency-aware spot-guided Transformer (CAST), which incorporates a spot-guided cross-attention module to avoid interfering with irrelevant areas, and a consistency-aware self-attention module to enhance matching capabilities with geometrically consistent correspondences. Furthermore, a lightweight fine matching module for both sparse keypoints and dense features can estimate the transformation accurately. Extensive experiments on both outdoor LiDAR point cloud datasets and indoor RGBD point cloud datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness.

A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

TL;DR

This work addresses robust, real-time point cloud registration without pose priors by introducing CAST, a consistency-aware spot-guided Transformer. CAST enforces geometric consistency during coarse matching through spot-guided cross-attention and consistency-aware self-attention, and couples it with a lightweight sparse-to-dense fine matching module for efficient, accurate pose estimation. It also employs a compatibility-graph embedding to filter outliers without relying on heavy hypothesis-and-selection pipelines. Extensive experiments across outdoor LiDAR and indoor RGB-D benchmarks demonstrate state-of-the-art accuracy, robustness, and efficiency, with strong performance in real-time odometry scenarios.

Abstract

Deep learning-based feature matching has shown great superiority for point cloud registration in the absence of pose priors. Although coarse-to-fine matching approaches are prevalent, the coarse matching of existing methods is typically sparse and loose without consideration of geometric consistency, which makes the subsequent fine matching rely on ineffective optimal transport and hypothesis-and-selection methods for consistency. Therefore, these methods are neither efficient nor scalable for real-time applications such as odometry in robotics. To address these issues, we design a consistency-aware spot-guided Transformer (CAST), which incorporates a spot-guided cross-attention module to avoid interfering with irrelevant areas, and a consistency-aware self-attention module to enhance matching capabilities with geometrically consistent correspondences. Furthermore, a lightweight fine matching module for both sparse keypoints and dense features can estimate the transformation accurately. Extensive experiments on both outdoor LiDAR point cloud datasets and indoor RGBD point cloud datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness.

Paper Structure

This paper contains 43 sections, 21 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Overview of CAST. The feature pyramid network down-samples the point clouds and learns features in multiple resolutions. The coarse matching module extracts consistency-aware semi-dense correspondences via a group of alternate consistency-aware self-attention modules and spot-guided cross-attention modules with multi-scale feature fusion. Finally, the fine matching module predicts correspondences for both sparse keypoints and dense features and estimates the transformation.
  • Figure 2: Illustration of consistency-aware self-attention and spot-guided cross-attention (Left), as well as visualization of the global cross-attention and spot-guided cross-attention (Right). For the left part, the green nodes are query nodes, while the red ones with correct correspondences (green dot lines) are reliable neighbors, and the blue one with a false correspondence (red dot line) is an unreliable neighbor. The self-attention (black lines) only attends to salient nodes while the cross-attention (black lines) only attends to spots (nodes within black circles).
  • Figure 3: Qualitative registration results on KITTI dataset. We show three examples in three columns. The first two rows present the raw point clouds and highlight the 3D keypoints with low uncertainty in red. Our keypoints are typically located in sharp corners and edges of buildings, pillars, and vehicles. The third row shows the predicted sparse keypoint correspondences with high scores, while the last row presents the aligned point clouds after pose estimation. Although a few outliers colored in red have not been filtered out, their distances are acceptable for accurate registration.
  • Figure 4: The detailed architecture of the KPConv-based feature pyramid network.
  • Figure 5: The detailed architecture of the consistency-aware spot-guided Transformer with multi-scale feature fusion for coarse feature matching.
  • ...and 3 more figures