Table of Contents
Fetching ...

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe

TL;DR

This work empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance and introduces CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE.

Abstract

Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at: https://github.com/Amazingren/Point-CMAE.

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

TL;DR

This work empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance and introduces CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE.

Abstract

Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at: https://github.com/Amazingren/Point-CMAE.
Paper Structure (14 sections, 7 equations, 4 figures, 6 tables)

This paper contains 14 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of: (a) MAE with MoCo he2020momentum style contrastive learning. An extra queue is required to maintain the negative samples during the entire pre-training. (b) MAE with BYOL grill2020bootstrap style contrastive learning. The asymmetric structure design with the fully connected predictor layer is used to exclude the negative samples. (c) The proposed Point-CMAE framework. Two identically structured decoders, updated differently, are employed to introduce explicit contrastive properties within the generative self-supervised pre-training paradigm (i.e., MAE-based).
  • Figure 2: The classification comparison of different contrastive learning pipelines, integrated with the baseline method Point-MAE PointMAE, is conducted on the ScanObjectNN uy2019revisiting datasets.
  • Figure 3: The framework of the proposed Point-CMAE. The symbols $\oplus$ denote the token dimension concatenation. The point patch embedding is denoted as "FPS&KNN". The symbol $\oplus$ denotes the token dimension concatenation.
  • Figure 4: (a) The classification results on the ScanobjectNN uy2019revisiting (PB_T50_RS) dataset cross different masking ratios. The corresponding average results across the entire masking ratios are depicted with the dashed lines. (b) Both the classification on the ScanobjectNN uy2019revisiting (PB_T50_RS) and the part segmentation results on the ShapeNetPart yi2016scalable are provided to study how the depth of the decoder affects the pre-traing.