ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers
Ioannis Romanelis, Vlassis Fotis, Konstantinos Moustakas, Adrian Munteanu
TL;DR
Self-supervised pretraining for point cloud transformers faces data scarcity and domain differences; this paper compares Masked Autoencoding (MAE) and Momentum Contrast (MoCo) pretraining. It introduces a strategic unfreezing finetuning schedule that preserves pretrained backbone knowledge while boosting downstream accuracy, achieving state-of-the-art among transformer models on ModelNet40 and ScanObjectNN. The authors also adapt explainability tools (CKA, attention visualization, receptive fields) to 3D data and show that MAE fosters local, semantically meaningful attention with increasing data. They find that larger pretraining data leads to local inductive bias reminiscent of convolutions, and demonstrate through ablations and contrastive learning that pretraining shapes representations in meaningful ways.
Abstract
In this paper we delve into the properties of transformers, attained through self-supervision, in the point cloud domain. Specifically, we evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. In our study we investigate the impact of data quantity on the learned features, and uncover similarities in the transformer's behavior across domains. Through comprehensive visualiations, we observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry. Moreover, we examine the finetuning process and its effect on the learned representations. Based on that, we devise an unfreezing strategy which consistently outperforms our baseline without introducing any other modifications to the model or the training pipeline, and achieve state-of-the-art results in the classification task among transformer models.
