P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders
Xuechao Chen, Ying Chen, Jialin Li, Qiang Nie, Hanqiu Deng, Yong Liu, Qixing Huang, Yang Li
TL;DR
The paper tackles the data-scale bottleneck in 3D pre-training by converting billions of 2D images into pseudo-3D point clouds using a large depth estimator, thereby creating a diverse, scalable pre-training corpus. It introduces P3P, a Masked Autoencoder-based framework with a novel voxel-based Sparse Weight Indexing tokenizer that supports a flexible, variable number of tokens and a corresponding 3D reconstruction target combining color, geometry, and occupancy information. Key contributions include the P3P-Lift dataset with $1.28$ million samples, a linear-time 3D tokenizer, and a hybrid loss that yields state-of-the-art results on 3D classification, few-shot learning, and segmentation, demonstrating strong cross-task generalization. The approach significantly eases scaling 3D pre-training by leveraging abundant 2D data, enabling more powerful 3D foundation models for perception tasks, while acknowledging resource constraints and proposing directions for further scale-up.
Abstract
3D pre-training is crucial to 3D perception tasks. Nevertheless, limited by the difficulties in collecting clean and complete 3D data, 3D pre-training has persistently faced data scaling challenges. In this work, we introduce a novel self-supervised pre-training framework that incorporates millions of images into 3D pre-training corpora by leveraging a large depth estimation model. New pre-training corpora encounter new challenges in representation ability and embedding efficiency of models. Previous pre-training methods rely on farthest point sampling and k-nearest neighbors to embed a fixed number of 3D tokens. However, these approaches prove inadequate when it comes to embedding millions of samples that feature a diverse range of point numbers, spanning from 1,000 to 100,000. In contrast, we propose a tokenizer with linear-time complexity, which enables the efficient embedding of a flexible number of tokens. Accordingly, a new 3D reconstruction target is proposed to cooperate with our 3D tokenizer. Our method achieves state-of-the-art performance in 3D classification, few-shot learning, and 3D segmentation. Code is available at https://github.com/XuechaoChen/P3P-MAE.
