Table of Contents
Fetching ...

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

Xuechao Chen, Ying Chen, Jialin Li, Qiang Nie, Hanqiu Deng, Yong Liu, Qixing Huang, Yang Li

TL;DR

The paper tackles the data-scale bottleneck in 3D pre-training by converting billions of 2D images into pseudo-3D point clouds using a large depth estimator, thereby creating a diverse, scalable pre-training corpus. It introduces P3P, a Masked Autoencoder-based framework with a novel voxel-based Sparse Weight Indexing tokenizer that supports a flexible, variable number of tokens and a corresponding 3D reconstruction target combining color, geometry, and occupancy information. Key contributions include the P3P-Lift dataset with $1.28$ million samples, a linear-time 3D tokenizer, and a hybrid loss that yields state-of-the-art results on 3D classification, few-shot learning, and segmentation, demonstrating strong cross-task generalization. The approach significantly eases scaling 3D pre-training by leveraging abundant 2D data, enabling more powerful 3D foundation models for perception tasks, while acknowledging resource constraints and proposing directions for further scale-up.

Abstract

3D pre-training is crucial to 3D perception tasks. Nevertheless, limited by the difficulties in collecting clean and complete 3D data, 3D pre-training has persistently faced data scaling challenges. In this work, we introduce a novel self-supervised pre-training framework that incorporates millions of images into 3D pre-training corpora by leveraging a large depth estimation model. New pre-training corpora encounter new challenges in representation ability and embedding efficiency of models. Previous pre-training methods rely on farthest point sampling and k-nearest neighbors to embed a fixed number of 3D tokens. However, these approaches prove inadequate when it comes to embedding millions of samples that feature a diverse range of point numbers, spanning from 1,000 to 100,000. In contrast, we propose a tokenizer with linear-time complexity, which enables the efficient embedding of a flexible number of tokens. Accordingly, a new 3D reconstruction target is proposed to cooperate with our 3D tokenizer. Our method achieves state-of-the-art performance in 3D classification, few-shot learning, and 3D segmentation. Code is available at https://github.com/XuechaoChen/P3P-MAE.

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

TL;DR

The paper tackles the data-scale bottleneck in 3D pre-training by converting billions of 2D images into pseudo-3D point clouds using a large depth estimator, thereby creating a diverse, scalable pre-training corpus. It introduces P3P, a Masked Autoencoder-based framework with a novel voxel-based Sparse Weight Indexing tokenizer that supports a flexible, variable number of tokens and a corresponding 3D reconstruction target combining color, geometry, and occupancy information. Key contributions include the P3P-Lift dataset with million samples, a linear-time 3D tokenizer, and a hybrid loss that yields state-of-the-art results on 3D classification, few-shot learning, and segmentation, demonstrating strong cross-task generalization. The approach significantly eases scaling 3D pre-training by leveraging abundant 2D data, enabling more powerful 3D foundation models for perception tasks, while acknowledging resource constraints and proposing directions for further scale-up.

Abstract

3D pre-training is crucial to 3D perception tasks. Nevertheless, limited by the difficulties in collecting clean and complete 3D data, 3D pre-training has persistently faced data scaling challenges. In this work, we introduce a novel self-supervised pre-training framework that incorporates millions of images into 3D pre-training corpora by leveraging a large depth estimation model. New pre-training corpora encounter new challenges in representation ability and embedding efficiency of models. Previous pre-training methods rely on farthest point sampling and k-nearest neighbors to embed a fixed number of 3D tokens. However, these approaches prove inadequate when it comes to embedding millions of samples that feature a diverse range of point numbers, spanning from 1,000 to 100,000. In contrast, we propose a tokenizer with linear-time complexity, which enables the efficient embedding of a flexible number of tokens. Accordingly, a new 3D reconstruction target is proposed to cooperate with our 3D tokenizer. Our method achieves state-of-the-art performance in 3D classification, few-shot learning, and 3D segmentation. Code is available at https://github.com/XuechaoChen/P3P-MAE.
Paper Structure (26 sections, 10 equations, 2 figures, 4 tables)

This paper contains 26 sections, 10 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: A comparison of the previous 3D tokenizer (top branch) with our 3D tokenizer (bottom branch). The right chart shows the giga floating-point operations (GFLOPs) needed in the previous tokenizer (F-K-P) and ours (V-P-S). Our 3D tokenizer requires many fewer operations than the previous tokenizer when embedding the same point cloud.
  • Figure 2: Overall pipeline of our 3D pre-training approach.