P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

Xuechao Chen; Ying Chen; Jialin Li; Qiang Nie; Hanqiu Deng; Yong Liu; Qixing Huang; Yang Li

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

Xuechao Chen, Ying Chen, Jialin Li, Qiang Nie, Hanqiu Deng, Yong Liu, Qixing Huang, Yang Li

TL;DR

The paper tackles the data-scale bottleneck in 3D pre-training by converting billions of 2D images into pseudo-3D point clouds using a large depth estimator, thereby creating a diverse, scalable pre-training corpus. It introduces P3P, a Masked Autoencoder-based framework with a novel voxel-based Sparse Weight Indexing tokenizer that supports a flexible, variable number of tokens and a corresponding 3D reconstruction target combining color, geometry, and occupancy information. Key contributions include the P3P-Lift dataset with $1.28$ million samples, a linear-time 3D tokenizer, and a hybrid loss that yields state-of-the-art results on 3D classification, few-shot learning, and segmentation, demonstrating strong cross-task generalization. The approach significantly eases scaling 3D pre-training by leveraging abundant 2D data, enabling more powerful 3D foundation models for perception tasks, while acknowledging resource constraints and proposing directions for further scale-up.

Abstract

3D pre-training is crucial to 3D perception tasks. Nevertheless, limited by the difficulties in collecting clean and complete 3D data, 3D pre-training has persistently faced data scaling challenges. In this work, we introduce a novel self-supervised pre-training framework that incorporates millions of images into 3D pre-training corpora by leveraging a large depth estimation model. New pre-training corpora encounter new challenges in representation ability and embedding efficiency of models. Previous pre-training methods rely on farthest point sampling and k-nearest neighbors to embed a fixed number of 3D tokens. However, these approaches prove inadequate when it comes to embedding millions of samples that feature a diverse range of point numbers, spanning from 1,000 to 100,000. In contrast, we propose a tokenizer with linear-time complexity, which enables the efficient embedding of a flexible number of tokens. Accordingly, a new 3D reconstruction target is proposed to cooperate with our 3D tokenizer. Our method achieves state-of-the-art performance in 3D classification, few-shot learning, and 3D segmentation. Code is available at https://github.com/XuechaoChen/P3P-MAE.

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

TL;DR

million samples, a linear-time 3D tokenizer, and a hybrid loss that yields state-of-the-art results on 3D classification, few-shot learning, and segmentation, demonstrating strong cross-task generalization. The approach significantly eases scaling 3D pre-training by leveraging abundant 2D data, enabling more powerful 3D foundation models for perception tasks, while acknowledging resource constraints and proposing directions for further scale-up.

Abstract

Paper Structure (26 sections, 10 equations, 2 figures, 4 tables)

This paper contains 26 sections, 10 equations, 2 figures, 4 tables.

Introduction
Approach
Preliminaries.
Pre-training data creation.
Embedding.
Masked Autoencoders pre-training.
Reconstruction target.
Experiment
Transfer Learning
Models.
Pre-training data and settings.
3D classification on 3D objects scanned from the real world.
Few-shot classification on 3D objects scanned from the real world.
3D classification on 3D CAD objects.
3D segmentation.
...and 11 more sections

Figures (2)

Figure 1: A comparison of the previous 3D tokenizer (top branch) with our 3D tokenizer (bottom branch). The right chart shows the giga floating-point operations (GFLOPs) needed in the previous tokenizer (F-K-P) and ours (V-P-S). Our 3D tokenizer requires many fewer operations than the previous tokenizer when embedding the same point cloud.
Figure 2: Overall pipeline of our 3D pre-training approach.

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

TL;DR

Abstract

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (2)