Table of Contents
Fetching ...

ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers

Ioannis Romanelis, Vlassis Fotis, Konstantinos Moustakas, Adrian Munteanu

TL;DR

Self-supervised pretraining for point cloud transformers faces data scarcity and domain differences; this paper compares Masked Autoencoding (MAE) and Momentum Contrast (MoCo) pretraining. It introduces a strategic unfreezing finetuning schedule that preserves pretrained backbone knowledge while boosting downstream accuracy, achieving state-of-the-art among transformer models on ModelNet40 and ScanObjectNN. The authors also adapt explainability tools (CKA, attention visualization, receptive fields) to 3D data and show that MAE fosters local, semantically meaningful attention with increasing data. They find that larger pretraining data leads to local inductive bias reminiscent of convolutions, and demonstrate through ablations and contrastive learning that pretraining shapes representations in meaningful ways.

Abstract

In this paper we delve into the properties of transformers, attained through self-supervision, in the point cloud domain. Specifically, we evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. In our study we investigate the impact of data quantity on the learned features, and uncover similarities in the transformer's behavior across domains. Through comprehensive visualiations, we observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry. Moreover, we examine the finetuning process and its effect on the learned representations. Based on that, we devise an unfreezing strategy which consistently outperforms our baseline without introducing any other modifications to the model or the training pipeline, and achieve state-of-the-art results in the classification task among transformer models.

ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers

TL;DR

Self-supervised pretraining for point cloud transformers faces data scarcity and domain differences; this paper compares Masked Autoencoding (MAE) and Momentum Contrast (MoCo) pretraining. It introduces a strategic unfreezing finetuning schedule that preserves pretrained backbone knowledge while boosting downstream accuracy, achieving state-of-the-art among transformer models on ModelNet40 and ScanObjectNN. The authors also adapt explainability tools (CKA, attention visualization, receptive fields) to 3D data and show that MAE fosters local, semantically meaningful attention with increasing data. They find that larger pretraining data leads to local inductive bias reminiscent of convolutions, and demonstrate through ablations and contrastive learning that pretraining shapes representations in meaningful ways.

Abstract

In this paper we delve into the properties of transformers, attained through self-supervision, in the point cloud domain. Specifically, we evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. In our study we investigate the impact of data quantity on the learned features, and uncover similarities in the transformer's behavior across domains. Through comprehensive visualiations, we observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry. Moreover, we examine the finetuning process and its effect on the learned representations. Based on that, we devise an unfreezing strategy which consistently outperforms our baseline without introducing any other modifications to the model or the training pipeline, and achieve state-of-the-art results in the classification task among transformer models.
Paper Structure (24 sections, 2 equations, 15 figures, 4 tables)

This paper contains 24 sections, 2 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Graphical description of the two pretraining pipelines studied in this paper, namely Masked-AutoEncoding (MAE) mae and Momentum Contrast (MoCo) MoCo. In simple terms, MAE trains an autoencoder to reconstruct a shape with missing parts, whereas MoCo trains two networks (Student/Teacher) to generate approximately equal predictions for different augmentations of a data sample.
  • Figure 2: Comparison of different unfreezing points for the backbone of the transformer model. Unfreezing the model too early or too late can result in suboptimal results, as the network may either 'forget' the features learned through pretraining or fail to acquire task-specific knowledge, respectively.
  • Figure 3: CKA comparison of the pretrained backbone (y-axis) with versions that have been finetuned for 300 epochs, unfreezing the backbone on various epochs (x-axis). The first and second blocks indicate the positional and feature embedding extractors, while the rest of the blocks correspond to the outputs of the attention layers. High values indicate high similarity between feature representations. The later the network is unfrozen, the higher the similarity with the pretrained backbone, retaining the properties learned through pretraining. In the case of (a), as done in mae, the final network has very little similarity with the pretrained backbone, thereby nullifying the effectiveness of pretraining.
  • Figure 4: Attention Visualization of the classification token for each block (1-12), averaged across heads (brighter = higher score). Although the classification token's attention score towards itself cannot be visualized, it has the highest value in all cases. This score is included in the normalization process, so that the relative scale between them is visible.
  • Figure 5: Attention Visualization of the classification token for each block (1-12), averaged across heads for a finetuned model without pretraining. It is evident the locations it attends to do not follow any recognizable pattern. In each layer, the attention is focused on a specific location, which is often the same between layers (e.g. head in layers 0,5,7,10).
  • ...and 10 more figures