Table of Contents
Fetching ...

Rotation-Invariant Transformer for Point Cloud Matching

Hao Yu, Zheng Qin, Ji Hou, Mahdi Saleh, Dongsheng Li, Benjamin Busam, Slobodan Ilic

TL;DR

This work introduces RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task, and proposes a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap.

Abstract

The intrinsic rotation invariance lies at the core of matching point clouds with handcrafted descriptors. However, it is widely despised by recent deep matchers that obtain the rotation invariance extrinsically via data augmentation. As the finite number of augmented rotations can never span the continuous SO(3) space, these methods usually show instability when facing rotations that are rarely seen. To this end, we introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We contribute both on the local and global levels. Starting from the local level, we introduce an attention mechanism embedded with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant geometry, upon which a novel attention-based encoder-decoder architecture is constructed. We further propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap. Experiments are conducted on both the rigid and non-rigid public benchmarks, where RoITr outperforms all the state-of-the-art models by a considerable margin in the low-overlapping scenarios. Especially when the rotations are enlarged on the challenging 3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall, respectively.

Rotation-Invariant Transformer for Point Cloud Matching

TL;DR

This work introduces RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task, and proposes a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap.

Abstract

The intrinsic rotation invariance lies at the core of matching point clouds with handcrafted descriptors. However, it is widely despised by recent deep matchers that obtain the rotation invariance extrinsically via data augmentation. As the finite number of augmented rotations can never span the continuous SO(3) space, these methods usually show instability when facing rotations that are rarely seen. To this end, we introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We contribute both on the local and global levels. Starting from the local level, we introduce an attention mechanism embedded with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant geometry, upon which a novel attention-based encoder-decoder architecture is constructed. We further propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap. Experiments are conducted on both the rigid and non-rigid public benchmarks, where RoITr outperforms all the state-of-the-art models by a considerable margin in the low-overlapping scenarios. Especially when the rotations are enlarged on the challenging 3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall, respectively.
Paper Structure (35 sections, 23 equations, 12 figures, 9 tables)

This paper contains 35 sections, 23 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Feature Matching Recall (FMR) on 3DLoMatch huang2021predator and Rotated 3DLoMatch. Distance to the diagonal represents the robustness against rotations. Among all the state-of-the-art approaches, RoITr not only ranks first on both benchmarks but also shows the best robustness against the enlarged rotations.
  • Figure 2: An Overview of RoITr. From left to right: (0). RoITr takes as input a pair of triplets $\mathcal{P} = (\mathbf{P}, \mathbf{N}, \mathbf{X})$ and $\mathcal{Q} = (\mathbf{Q}, \mathbf{M}, \mathbf{Y})$, each with three dimensions referring to the point cloud, the estimated normals, and the initial features. (1).[$\S{}$. \ref{['sec:local_geometry']}] A stack of encoder blocks hierarchically downsamples the points to coarser superpoints and encodes the local geometry, yielding superpoint triplets $\mathcal{P}^\prime$ and $\mathcal{Q}^\prime$. Each encoder block consists of an Attentional Abstraction Layer (AAL) for downsampling and abstraction, followed by $e\times$ PPF Attention Layers (PALs) for local geometry encoding and context aggregation. Both of them are based on our proposed PPF Attention Mechanism (PAM), which enables the pose-agnostic encoding of pure geometry. (See Fig. \ref{['fig:differences']} and Fig. \ref{['fig:local_attention']}). (2).[$\S{}$. \ref{['sec:global_context']}] Global information is fused to enhance the superpoint features of $\mathcal{P}^\prime$ and $\mathcal{Q}^\prime$. The geometric cues are globally aggregated as a rotation-invariant position representation, which introduces spatial awareness in the consecutive cross-frame context aggregation. After a stack of $g\times$ global transformers, the globally-enhanced triplets $\widetilde{\mathcal{P}}^\prime$ and $\widetilde{\mathcal{Q}}^\prime$ are produced. (3).[$\S{}$. \ref{['sec:local_geometry']}] Superpoint triplets $\mathcal{P}^\prime$ and $\mathcal{Q}^\prime$ are decoded to point triplets $\hat{\mathcal{P}}$ and $\hat{\mathcal{Q}}$ by a stack of decoder blocks. Each block consists of a Transition Up Layer (TUL) for upsampling and context aggregation, followed by $d\times$ PALs. (4).[$\S{}$. \ref{['sec:matching']}] By adopting the coarse-to-fine matching yu2021cofinet, $\widetilde{\mathcal{P}}^\prime$ and $\widetilde{\mathcal{Q}}^\prime$ are matched to generate superpoint correspondences, which are consecutively refined to point correspondences between $\hat{\mathcal{P}}$ and $\hat{\mathcal{Q}}$. (5).$\hat{\mathcal{C}}$ is established between $\hat{\mathcal{P}}$ and $\hat{\mathcal{Q}}$.
  • Figure 3: Illustration of different self-attention computation in the standard attention vaswani2017attention, GeoTrans qin2022geometric, and PAM.
  • Figure 4: Left: The workflow of the PPF Attention Mechanism (PAM). Right: Detailed calculation of the attention.
  • Figure 5: The computation graph of our global transformer consisting of the Geometry-Aware Self-Attention Module (GSM) and Position-Aware Cross-Attention Module (PCM).
  • ...and 7 more figures