Table of Contents
Fetching ...

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu

TL;DR

The paper tackles data-starvation in 3D point cloud analysis by transferring knowledge from large pre-trained 2D image models. It introduces Point-to-Pixel Prompting that converts point clouds into geometry-preserved, colorized images and keeps the image backbone frozen while training a lightweight coloring module. Experiments show that larger 2D models yield better 3D performance, achieving 89.3% accuracy on the hardest ScanObjectNN setting and competitive results on ModelNet40 and ShapeNetPart. This approach provides a parameter-efficient pathway for leveraging 2D pre-training in 3D tasks and includes open-source code for reproducibility.

Abstract

Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https://github.com/wangzy22/P2P.

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

TL;DR

The paper tackles data-starvation in 3D point cloud analysis by transferring knowledge from large pre-trained 2D image models. It introduces Point-to-Pixel Prompting that converts point clouds into geometry-preserved, colorized images and keeps the image backbone frozen while training a lightweight coloring module. Experiments show that larger 2D models yield better 3D performance, achieving 89.3% accuracy on the hardest ScanObjectNN setting and competitive results on ModelNet40 and ShapeNetPart. This approach provides a parameter-efficient pathway for leveraging 2D pre-training in 3D tasks and includes open-source code for reproducibility.

Abstract

Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https://github.com/wangzy22/P2P.
Paper Structure (39 sections, 1 equation, 4 figures, 8 tables)

This paper contains 39 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Images produced by our Point-to-Pixel Prompting. We show the original point clouds (top line) and the projected colorful images produced by our P2P of synthetic objects from ModelNet40 (left five columns) and real-world objects from ScanObjectNN (right three columns) from two different projection views.
  • Figure 2: The pipeline of our proposed P2P framework. Taking a point cloud $P$ as the input, we first encode the geometry information for each point. Then we sample a projection view and rearrange the point-wise features into an image-style layout to obtain the pixel-wise features with Geometry-preserved Projection. The colorless projection will be enriched to produce a colorful image $I$ with the color information via a learnable Coloring Module. Our P2P framework can be easily transferred to several downstream tasks with a task-specific head with the help of the transferable visual knowledge from the pre-trained image model. We take the classical Vision Transformer dosovitskiy2020vit as our pre-trained image model for illustration in this pipeline.
  • Figure 3: Ablations illustration.$(*)$ shows the pipeline of the overall P2P framework. Part (a) displays ablations on replacing P2P prompting with vanilla fine-tuning or visual prompt tuning (VPT) jia2022vpt. Part (b) illustrates ablations on Point-to-Pixel Prompting designs. Part (c) shows different tuning strategies on the pre-trained image model in our P2P framework. Gray letters on top of each model correspond to the Model column in Table \ref{['tab:ablation']}.
  • Figure 4: Visualization of feature distributions in t-SNE representations. Best view in colors.