Table of Contents
Fetching ...

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Renrui Zhang, Ziyu Guo, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li, Peng Gao

TL;DR

Point-M2AE introduces a hierarchical, multi-scale masked autoencoder for self-supervised pre-training on 3D point clouds. By designing a pyramid encoder-decoder with skip connections, back-projecting cross-scale visible regions, and a local-spatial attention scheme during fine-tuning, it learns robust multi-scale geometric representations. The approach achieves state-of-the-art results on ModelNet40 linear evaluation and delivers substantial gains across downstream tasks including ScanObjectNN classification, ShapeNetPart segmentation, few-shot learning, and 3D object detection on ScanNetV2, with ablations validating the importance of hierarchy and masking strategy. This work provides a scalable, effective framework for 3D representation learning from unlabeled data, enabling strong transfer performance across diverse 3D vision tasks.

Abstract

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

TL;DR

Point-M2AE introduces a hierarchical, multi-scale masked autoencoder for self-supervised pre-training on 3D point clouds. By designing a pyramid encoder-decoder with skip connections, back-projecting cross-scale visible regions, and a local-spatial attention scheme during fine-tuning, it learns robust multi-scale geometric representations. The approach achieves state-of-the-art results on ModelNet40 linear evaluation and delivers substantial gains across downstream tasks including ScanObjectNN classification, ShapeNetPart segmentation, few-shot learning, and 3D object detection on ScanNetV2, with ablations validating the importance of hierarchy and masking strategy. This work provides a scalable, effective framework for 3D representation learning from unlabeled data, enabling strong transfer performance across diverse 3D vision tasks.

Abstract

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.
Paper Structure (53 sections, 1 equation, 18 figures, 5 tables)

This paper contains 53 sections, 1 equation, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Comparison of MAE (Top) and our Point-M2AE (Bottom). MAE for 2D image pre-training adopts standard transformer of the plain encoder and decoder, while Point-M2AE introduces a hierarchical transformer with skip connections for multi-scale point cloud pre-training.
  • Figure 2: Overall pipeline of Point-M2AE. After the multi-scale masking, we embed point tokens at the $1$-st scale and feed the visible ones into a hierarchical encoder-decoder transformer, which captures both high-level semantics and fine-grained patterns of the point cloud during pre-training.
  • Figure 3: Multi-scale masking strategy. To obtain a consistent visible regions across scales, we first represent the input point cloud by multi-scale coordinates and generate the random mask at the highest one. Then, we back-project the unmasked visible positions to all earlier scales.
  • Figure 4: Linear evaluation on ModelNet40 modelnet40 by SVM. We report different self-supervised learning methods and underline the second-best one.
  • Figure 5: Shape classification on ModelNet40 modelnet40. '#points' and 'Acc.' denote the number of points for training and the overall accuracy. [S] represents fine-tuning after self-supervised pre-training.
  • ...and 13 more figures