RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

Jiuming Liu; Guangming Wang; Zhe Liu; Chaokang Jiang; Haoang Li; Mengmeng Liu; Tianchen Deng; Marc Pollefeys; Michael Ying Yang; Hesheng Wang

RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

Jiuming Liu, Guangming Wang, Zhe Liu, Chaokang Jiang, Haoang Li, Mengmeng Liu, Tianchen Deng, Marc Pollefeys, Michael Ying Yang, Hesheng Wang

Abstract

Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale LiDAR registration methods has been rarely explored before. Challenges mainly arise from the huge point scale, complex point distribution, and numerous outliers within outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local descriptors and then leverage robust estimators (e.g. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose a novel end-to-end differential transformer network, termed RegFormer++, for large-scale point cloud alignment without requiring any further post-processing. Specifically, a hierarchical projection-aware 2D transformer with linear complexity is proposed to project raw LiDAR points onto a cylindrical surface and extract global point features, which can improve resilience to outliers due to long-range dependencies. Because we fill original 3D coordinates into 2D projected positions, our designed transformer can benefit from both high efficiency in 2D processing and accuracy from 3D geometric information. Furthermore, to effectively reduce wrong point matching, a Bijective Association Transformer (BAT) is designed, combining both cross attention and all-to-all point gathering. To improve training stability and robustness, a feature-transformed optimal transport module is also designed for regressing the final pose transformation. Extensive experiments on KITTI, NuScenes, and Argoverse datasets demonstrate that our model achieves state-of-the-art performance in terms of both accuracy and efficiency.

RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

Abstract

Paper Structure (26 sections, 22 equations, 12 figures, 12 tables, 1 algorithm)

This paper contains 26 sections, 22 equations, 12 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Classical Point Cloud Registration
Learning-based Point Cloud Registration
Large-Scale Point Cloud Registration
Transformer in the Registration Task
RegFormer++
Overall Architecture
Cylindrical Projection
Kernel-based Efficient Patch Embedding
Point Swin Transformer
Bijective Association Transformer
Feature-Transformed Optimal Transport
Estimation of the Rigid Transformation
Loss Function
...and 11 more sections

Figures (12)

Figure 1: Comparison with previous point cloud registration works. Previous methods (A) extract local descriptors and establish explicit correspondences, while our RegFormer++ rely on globally-aware features and learn cross-frame association by cross attention requiring no correspondences. Furthermore, prior literature resorts to the RANSAC fischler1981random algorithm for outlier filtering. In contrast, our method leverage attention mechanism to softly remove outliers.
Figure 2: The overall architecture of RegFormer++. We first project point clouds onto a 2D surface and feed its patches into three cascaded feature extraction transformers. For the cross-frame association, we design a Bijective Association Transformer module which includes a cross attention module and an all-to-all point gathering. Finally, transformations are refined iteratively.
Figure 3: Cylindrical projection. We project 3D point clouds onto a 2D surface and fill each pixel with its raw $x,y,z$ coordinates. A projection mask is also proposed to remove invalid positions.
Figure 4: Visualization of kernel-based efficient patch embedding. For each center point, its surrounding neighbor points within a predefined 2D kernel are indexed and grouped. Then, outliers that are harmful to registration are filtered. Above grouping and filtering in 2D space actually happen in 3D space based on the cylindrical projection process.
Figure 5: All-to-all point gathering. After the cross-attention mechanism leveraged for preliminary information exchange between two frames, geometric characteristics of conditioned features ${\tilde{F}}^{S}_{3}, {\tilde{F}}^{T}_{3}$ are fully considered to generate the initial motion embeddings. The initial embeddings include coordinate information, point and neighbor similarity features, and content features.
...and 7 more figures

RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

Abstract

RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

Authors

Abstract

Table of Contents

Figures (12)