Table of Contents
Fetching ...

PMT-MAE: Dual-Branch Self-Supervised Learning with Distillation for Efficient Point Cloud Classification

Qiang Zheng, Chao Zhang, Jian Sun

TL;DR

PMT-MAE addresses efficient self-supervised learning for 3D point clouds by introducing a dual-branch architecture that combines Transformer and MLP modules, coupled with a two-stage distillation from the Point-M2AE teacher. It uses MAE-style masking with a non-pyramidal base and distills knowledge during pre-training (feature distillation) and fine-tuning (logit distillation), achieving 93.6% accuracy on ModelNet40 with only 40 training epochs. The approach outperforms the baseline Point-MAE and the teacher Point-M2AE while delivering favorable FLOPs, highlighting its practicality for resource-constrained scenarios. These results emphasize the value of feature diversity from dual-branch representations and targeted distillation for robust 3D representations in point-cloud classification.

Abstract

Advances in self-supervised learning are essential for enhancing feature extraction and understanding in point cloud processing. This paper introduces PMT-MAE (Point MLP-Transformer Masked Autoencoder), a novel self-supervised learning framework for point cloud classification. PMT-MAE features a dual-branch architecture that integrates Transformer and MLP components to capture rich features. The Transformer branch leverages global self-attention for intricate feature interactions, while the parallel MLP branch processes tokens through shared fully connected layers, offering a complementary feature transformation pathway. A fusion mechanism then combines these features, enhancing the model's capacity to learn comprehensive 3D representations. Guided by the sophisticated teacher model Point-M2AE, PMT-MAE employs a distillation strategy that includes feature distillation during pre-training and logit distillation during fine-tuning, ensuring effective knowledge transfer. On the ModelNet40 classification task, achieving an accuracy of 93.6\% without employing voting strategy, PMT-MAE surpasses the baseline Point-MAE (93.2\%) and the teacher Point-M2AE (93.4\%), underscoring its ability to learn discriminative 3D point cloud representations. Additionally, this framework demonstrates high efficiency, requiring only 40 epochs for both pre-training and fine-tuning. PMT-MAE's effectiveness and efficiency render it well-suited for scenarios with limited computational resources, positioning it as a promising solution for practical point cloud analysis.

PMT-MAE: Dual-Branch Self-Supervised Learning with Distillation for Efficient Point Cloud Classification

TL;DR

PMT-MAE addresses efficient self-supervised learning for 3D point clouds by introducing a dual-branch architecture that combines Transformer and MLP modules, coupled with a two-stage distillation from the Point-M2AE teacher. It uses MAE-style masking with a non-pyramidal base and distills knowledge during pre-training (feature distillation) and fine-tuning (logit distillation), achieving 93.6% accuracy on ModelNet40 with only 40 training epochs. The approach outperforms the baseline Point-MAE and the teacher Point-M2AE while delivering favorable FLOPs, highlighting its practicality for resource-constrained scenarios. These results emphasize the value of feature diversity from dual-branch representations and targeted distillation for robust 3D representations in point-cloud classification.

Abstract

Advances in self-supervised learning are essential for enhancing feature extraction and understanding in point cloud processing. This paper introduces PMT-MAE (Point MLP-Transformer Masked Autoencoder), a novel self-supervised learning framework for point cloud classification. PMT-MAE features a dual-branch architecture that integrates Transformer and MLP components to capture rich features. The Transformer branch leverages global self-attention for intricate feature interactions, while the parallel MLP branch processes tokens through shared fully connected layers, offering a complementary feature transformation pathway. A fusion mechanism then combines these features, enhancing the model's capacity to learn comprehensive 3D representations. Guided by the sophisticated teacher model Point-M2AE, PMT-MAE employs a distillation strategy that includes feature distillation during pre-training and logit distillation during fine-tuning, ensuring effective knowledge transfer. On the ModelNet40 classification task, achieving an accuracy of 93.6\% without employing voting strategy, PMT-MAE surpasses the baseline Point-MAE (93.2\%) and the teacher Point-M2AE (93.4\%), underscoring its ability to learn discriminative 3D point cloud representations. Additionally, this framework demonstrates high efficiency, requiring only 40 epochs for both pre-training and fine-tuning. PMT-MAE's effectiveness and efficiency render it well-suited for scenarios with limited computational resources, positioning it as a promising solution for practical point cloud analysis.
Paper Structure (21 sections, 6 equations, 8 figures, 8 tables)

This paper contains 21 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: MAE Framework for Visual Learning Point-MAE2022.
  • Figure 2: Masking strategies for Point-MAE Point-MAE2022 (a) and Point-M2AE Point-M2AE2022 (b). Red and shaded blocks represent visible and masked tokens, respectively. Point-MAE Point-MAE2022 employs a non-pyramidal structure with consistent masking across stages, while Point-M2AE Point-M2AE2022's pyramidal structure necessitates complex masking operations in reverse stage order. This leads to inconsistent visible token counts, increased computational and memory overhead, and a diminishing effective masking rate with shallower stages, hindering model scalability.
  • Figure 3: Illustration of the dual-branch block.
  • Figure 4: The two-stage distillation framework used in PMT-MAE includes distinct processes for the pre-training and fine-tuning phases, leveraging the strengths of the Point-M2AE Point-M2AE2022 model. (a) In the pre-training phase, the framework employs the pre-trained model of Point-M2AE Point-M2AE2022 for feature distillation. Here, the encoder of PMT-MAE is guided to replicate the output features of the Point-M2AE Point-M2AE2022 encoder, ensuring that the student model effectively captures meaningful feature representations. (b) During the fine-tuning phase, the framework addresses the point cloud classification problem by utilizing the fine-tuned model of Point-M2AE Point-M2AE2022. In this phase, the output of PMT-MAE undergoes logit distillation, focusing on transferring the classification prediction knowledge from the teacher model to the student model. This two-stage approach enhances the classification accuracy and robustness of PMT-MAE, providing a comprehensive strategy for effective knowledge transfer.
  • Figure 5: Histogram of the Pearson correlation coefficient ($r$) distribution for the output features of the two branches across six blocks in the PMT-MAE-S model. Each block's distribution is shown as a separate subdiagram.
  • ...and 3 more figures