Table of Contents
Fetching ...

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang

TL;DR

This work tackles 3D self-supervised pretraining by challenging the standard MAE objective of reconstructing masked point positions. It proposes MaskFeat3D, which predicts intrinsic features—namely point normals and surface variation—at masked points using an attention-based decoder that is independent of the encoder. The approach yields consistent gains across diverse encoders and downstream tasks (classification, segmentation, detection), with ablations showing the benefit of jointly modeling normals and variation and the importance of self-attention in the decoder. The results indicate that feature-focused reconstruction, rather than position reconstruction, leads to more robust, scalable 3D pretraining, benefiting both synthetic datasets and real-world scenes.

Abstract

Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

TL;DR

This work tackles 3D self-supervised pretraining by challenging the standard MAE objective of reconstructing masked point positions. It proposes MaskFeat3D, which predicts intrinsic features—namely point normals and surface variation—at masked points using an attention-based decoder that is independent of the encoder. The approach yields consistent gains across diverse encoders and downstream tasks (classification, segmentation, detection), with ablations showing the benefit of jointly modeling normals and variation and the importance of self-attention in the decoder. The results indicate that feature-focused reconstruction, rather than position reconstruction, leads to more robust, scalable 3D pretraining, benefiting both synthetic datasets and real-world scenes.

Abstract

Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.
Paper Structure (16 sections, 3 equations, 3 figures, 6 tables)

This paper contains 16 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison of standard Point-MAE and our proposed method. Unlike standard Point-MAE that uses masked points as the prediction target, our method use a novel attention-based decoder to leverage masked points as an additional input and infer the corresponding features.
  • Figure 2: The pretraining pipeline of our masked 3D feature prediction approach. Given a complete input point cloud, we first separate it into masked points and unmasked points (We use cube mask here for better visualization). We take unmasked points as the encoder input and output the block feature pairs. Then the decoder takes the block feature pairs and query points(i.e., masked points) as the input, and predicts the per-query-point features.
  • Figure 3: Visualization of point features.. The point normal is color-coded by the normal vector. The surface variation is color-coded where white indicates low value and red indicates high value.