Table of Contents
Fetching ...

PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

Xiangdong Zhang, Shaofeng Zhang, Junchi Yan

TL;DR

This paper shows a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well, thus preventing the encoder from learning semantic representations.

Abstract

Masked autoencoder has been widely explored in point cloud self-supervised learning, whereby the point cloud is generally divided into visible and masked parts. These methods typically include an encoder accepting visible patches (normalized) and corresponding patch centers (position) as input, with the decoder accepting the output of the encoder and the centers (position) of the masked parts to reconstruct each point in the masked patches. Then, the pre-trained encoders are used for downstream tasks. In this paper, we show a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. In other words, the centers of patches are important and the reconstruction objective does not necessarily rely on representations of the encoder, thus preventing the encoder from learning semantic representations. Based on this key observation, we propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) which guides the model to learn to predict the significant centers and use the predicted centers to replace the directly provided centers. Specifically, we propose a Predicting Center Module (PCM) that shares parameters with the original encoder with extra cross-attention to predict centers. Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE, particularly surpassing it by 5.50% on OBJ-BG, 6.03% on OBJ-ONLY, and 5.17% on PB-T50-RS for 3D object classification on the ScanObjectNN dataset. The code is available at https://github.com/aHapBean/PCP-MAE.

PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

TL;DR

This paper shows a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well, thus preventing the encoder from learning semantic representations.

Abstract

Masked autoencoder has been widely explored in point cloud self-supervised learning, whereby the point cloud is generally divided into visible and masked parts. These methods typically include an encoder accepting visible patches (normalized) and corresponding patch centers (position) as input, with the decoder accepting the output of the encoder and the centers (position) of the masked parts to reconstruct each point in the masked patches. Then, the pre-trained encoders are used for downstream tasks. In this paper, we show a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. In other words, the centers of patches are important and the reconstruction objective does not necessarily rely on representations of the encoder, thus preventing the encoder from learning semantic representations. Based on this key observation, we propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) which guides the model to learn to predict the significant centers and use the predicted centers to replace the directly provided centers. Specifically, we propose a Predicting Center Module (PCM) that shares parameters with the original encoder with extra cross-attention to predict centers. Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE, particularly surpassing it by 5.50% on OBJ-BG, 6.03% on OBJ-ONLY, and 5.17% on PB-T50-RS for 3D object classification on the ScanObjectNN dataset. The code is available at https://github.com/aHapBean/PCP-MAE.
Paper Structure (17 sections, 12 equations, 5 figures, 14 tables)

This paper contains 17 sections, 12 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Illustrations of MAE reconstruction results for 2-D MAE and Point-MAE when masking ratio equals to 100%.
  • Figure 2: Overview of the proposed PCP-MAE. After patch division, the centers and normalized patches are divided into visible and masked parts, with center coordinates embedded into positional embedding (PE) and patches embedded into tokens (embeddings). The encoder accepts visible tokens and PE as input, performing self-attention. Simultaneously, the weight-shared PCM (Predicting Center Module) performs cross-attention (masked tokens as query and visible along with masked tokens as key and value) to acquire knowledge to predict the positional embeddings of the masked patches. CD-$\mathcal{L}_2$ refers to the $l_2$ Chamfer Distance loss function fan2017chamferDist.
  • Figure 3: Performance of different masking ratios in our PCP-MAE. The accuracy (%) on PBT50RS variant of ScanObjectNN are reported. Masking ratio $0.6$ performs the best.
  • Figure 4: Performance of different $\eta$ in objective function $\mathcal{L} = \eta \mathcal{L}_{PC} + \mathcal{L}_{Recon}$. The accuracy (%) on PBT50RS variant of ScanObjectNN are reported. $\eta=0.1$ performs the best.
  • Figure 5: Additional visualization results of Point-MAE reconstruction results on the ScanObjectNN dataset.