Table of Contents
Fetching ...

RI-MAE: Rotation-Invariant Masked AutoEncoders for Self-Supervised Point Cloud Representation Learning

Kunming Su, Qiuxia Wu, Panpan Cai, Xiaogang Zhu, Xuequan Lu, Zhiyong Wang, Kun Hu

TL;DR

RI-MAE tackles the challenge of rotation sensitivity in self-supervised masked point-cloud learning by introducing a rotation-invariant Transformer backbone (RI-Transformer) that disentangles geometry content from orientation and employs rotation-invariant orientation and position embeddings. It further leverages a dual-branch teacher–student framework to perform masked patch reconstruction entirely within a learned rotation-invariant latent space, ensuring stable supervision across arbitrary orientations. The approach yields state-of-the-art results across real-world classification, few-shot learning, and segmentation benchmarks under rotated conditions, demonstrating strong rotation robustness and practical applicability. The method is pretrained on ShapeNet and demonstrates consistent performance gains across multiple downstream tasks, with ablations confirming the contributions of RI-OE, RI-PE, and the dual-branch design.

Abstract

Masked point modeling methods have recently achieved great success in self-supervised learning for point cloud data. However, these methods are sensitive to rotations and often exhibit sharp performance drops when encountering rotational variations. In this paper, we propose a novel Rotation-Invariant Masked AutoEncoders (RI-MAE) to address two major challenges: 1) achieving rotation-invariant latent representations, and 2) facilitating self-supervised reconstruction in a rotation-invariant manner. For the first challenge, we introduce RI-Transformer, which features disentangled geometry content, rotation-invariant relative orientation and position embedding mechanisms for constructing rotation-invariant point cloud latent space. For the second challenge, a novel dual-branch student-teacher architecture is devised. It enables the self-supervised learning via the reconstruction of masked patches within the learned rotation-invariant latent space. Each branch is based on an RI-Transformer, and they are connected with an additional RI-Transformer predictor. The teacher encodes all point patches, while the student solely encodes unmasked ones. Finally, the predictor predicts the latent features of the masked patches using the output latent embeddings from the student, supervised by the outputs from the teacher. Extensive experiments demonstrate that our method is robust to rotations, achieving the state-of-the-art performance on various downstream tasks. Our code is available at https://github.com/kunmingsu07/RI-MAE.

RI-MAE: Rotation-Invariant Masked AutoEncoders for Self-Supervised Point Cloud Representation Learning

TL;DR

RI-MAE tackles the challenge of rotation sensitivity in self-supervised masked point-cloud learning by introducing a rotation-invariant Transformer backbone (RI-Transformer) that disentangles geometry content from orientation and employs rotation-invariant orientation and position embeddings. It further leverages a dual-branch teacher–student framework to perform masked patch reconstruction entirely within a learned rotation-invariant latent space, ensuring stable supervision across arbitrary orientations. The approach yields state-of-the-art results across real-world classification, few-shot learning, and segmentation benchmarks under rotated conditions, demonstrating strong rotation robustness and practical applicability. The method is pretrained on ShapeNet and demonstrates consistent performance gains across multiple downstream tasks, with ablations confirming the contributions of RI-OE, RI-PE, and the dual-branch design.

Abstract

Masked point modeling methods have recently achieved great success in self-supervised learning for point cloud data. However, these methods are sensitive to rotations and often exhibit sharp performance drops when encountering rotational variations. In this paper, we propose a novel Rotation-Invariant Masked AutoEncoders (RI-MAE) to address two major challenges: 1) achieving rotation-invariant latent representations, and 2) facilitating self-supervised reconstruction in a rotation-invariant manner. For the first challenge, we introduce RI-Transformer, which features disentangled geometry content, rotation-invariant relative orientation and position embedding mechanisms for constructing rotation-invariant point cloud latent space. For the second challenge, a novel dual-branch student-teacher architecture is devised. It enables the self-supervised learning via the reconstruction of masked patches within the learned rotation-invariant latent space. Each branch is based on an RI-Transformer, and they are connected with an additional RI-Transformer predictor. The teacher encodes all point patches, while the student solely encodes unmasked ones. Finally, the predictor predicts the latent features of the masked patches using the output latent embeddings from the student, supervised by the outputs from the teacher. Extensive experiments demonstrate that our method is robust to rotations, achieving the state-of-the-art performance on various downstream tasks. Our code is available at https://github.com/kunmingsu07/RI-MAE.
Paper Structure (30 sections, 15 equations, 3 figures, 8 tables)

This paper contains 30 sections, 15 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of the consistent relative orientation between point patches. A point cloud can be partitioned into several point patches, each of which can be regarded as an aligned patch rotated from a canonical pose with a specific rotation. While the overall pose of the point cloud can undergo various rotations, the relative rotation $\Delta R$ between any two patches remains constant and thus is rotation-invariant.
  • Figure 2: Overview of the proposed RI-MAE architecture. The input point cloud is first divided into point patches via FPS and KNN, and PCA is utilized to align the patches and obtain rotation matrices relative to canonical poses. Then geometry content tokens, RI-OEs, and RI-PEs are formulated as RI-Transformer's inputs. Finally, task heads are employed for downstream tasks. A dual-branch student-teacher scheme is devised to conduct the self-supervised learning to pretrain the RI-Transformer.
  • Figure 3: Visualization of part segmentation results on ShapeNet in the z/SO3 scenario. The leftmost column is the ground truth and the rest columns are the testing results of RI-MAE under arbitrary rotations.