Table of Contents
Fetching ...

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, Hongsheng Li

TL;DR

This work tackles the data bottleneck in 3D representation learning by translating robust 2D pre-trained knowledge into 3D via masked autoencoding. By projecting 3D point clouds into multi-view depth maps and leveraging off-the-shelf 2D models, the method introduces two image-to-point schemes: 2D-guided masking and 2D-semantic reconstruction, enabling effective cross-modal transfer without requiring 2D-3D paired data. Empirically, I2P-MAE achieves strong linear SVM performance on ModelNet40, state-of-the-art results on the hardest ScanObjectNN split after fine-tuning, and demonstrated data-efficient pre-training with limited 3D data. The results suggest the practicality of 2D-to-3D knowledge transfer for scalable, transferable 3D representations with reduced reliance on large 3D datasets.

Abstract

Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

TL;DR

This work tackles the data bottleneck in 3D representation learning by translating robust 2D pre-trained knowledge into 3D via masked autoencoding. By projecting 3D point clouds into multi-view depth maps and leveraging off-the-shelf 2D models, the method introduces two image-to-point schemes: 2D-guided masking and 2D-semantic reconstruction, enabling effective cross-modal transfer without requiring 2D-3D paired data. Empirically, I2P-MAE achieves strong linear SVM performance on ModelNet40, state-of-the-art results on the hardest ScanObjectNN split after fine-tuning, and demonstrated data-efficient pre-training with limited 3D data. The results suggest the practicality of 2D-to-3D knowledge transfer for scalable, transferable 3D representations with reduced reliance on large 3D datasets.

Abstract

Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.
Paper Structure (42 sections, 4 equations, 10 figures, 11 tables)

This paper contains 42 sections, 4 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Image-to-Point Masked Autoencoders. We leverage the 2D pre-trained models to guide the MAE pre-training in 3D, which alleviates the need of large-scale 3D datasets and learns from 2D knowledge for superior 3D representations.
  • Figure 2: Comparison of (Left) Existing Methods pang2022maskedzhang2022point and (Right) our I2P-MAE. On top of the general 3D MAE architecture, I2P-MAE introduces two schemes of image-to-point learning: 2D-guided masking and 2D-semantic reconstruction.
  • Figure 3: Pre-training Epochs vs. Linear SVM Accuracy on ModelNet40 modelnet40. With the image-to-point learning schemes, I2P-MAE exerts superior transferable capability with much faster convergence speed than Point-MAE pang2022masked and Point-M2AE zhang2022point.
  • Figure 4: The Pipeline of I2P-MAE. Given an input point cloud, we leverage the 2D pre-trained models to generate two guidance signals from the projected depth maps: 2D saliency maps and 2D visual features. We respectively conduct 2D-guided masking and 2D-semantic reconstruction to transfer the encoded 2D knowledge for 3D point cloud pre-training.
  • Figure 5: Image-to-Point Operation (I2P). Indexed by 3D point coordinates, the corresponding multi-view 2D representations are back-projected into 3D space for aggregation.
  • ...and 5 more figures