Table of Contents
Fetching ...

Parameter-efficient Prompt Learning for 3D Point Cloud Understanding

Hongyu Sun, Yongcai Wang, Wang Chen, Haoran Deng, Deying Li

TL;DR

This work tackles the challenge of adapting large multi-modal models to 3D point cloud understanding in a parameter- and data-efficient manner. It introduces PPT, consisting of a learnable PromptLearner to replace hand-crafted prompts and a lightweight PointAdapter, with the 3D encoder frozen to maximize efficiency. The method achieves state-of-the-art or strongly competitive results across 3D recognition, few-shot learning, and part segmentation on diverse datasets, while using orders of magnitude fewer trainable parameters than full fine-tuning. The findings demonstrate that parameter-efficient prompt tuning can effectively transfer rich multi-modal knowledge to 3D tasks, with clear gains in data efficiency and practical deployment potential.

Abstract

This paper presents a parameter-efficient prompt tuning method, named PPT, to adapt a large multi-modal model for 3D point cloud understanding. Existing strategies are quite expensive in computation and storage, and depend on time-consuming prompt engineering. We address the problems from three aspects. Firstly, a PromptLearner module is devised to replace hand-crafted prompts with learnable contexts to automate the prompt tuning process. Then, we lock the pre-trained backbone instead of adopting the full fine-tuning paradigm to substantially improve the parameter efficiency. Finally, a lightweight PointAdapter module is arranged near target tasks to enhance prompt tuning for 3D point cloud understanding. Comprehensive experiments are conducted to demonstrate the superior parameter and data efficiency of the proposed method.Meanwhile, we obtain new records on 4 public datasets and multiple 3D tasks, i.e., point cloud recognition, few-shot learning, and part segmentation. The implementation is available at https://github.com/auniquesun/PPT.

Parameter-efficient Prompt Learning for 3D Point Cloud Understanding

TL;DR

This work tackles the challenge of adapting large multi-modal models to 3D point cloud understanding in a parameter- and data-efficient manner. It introduces PPT, consisting of a learnable PromptLearner to replace hand-crafted prompts and a lightweight PointAdapter, with the 3D encoder frozen to maximize efficiency. The method achieves state-of-the-art or strongly competitive results across 3D recognition, few-shot learning, and part segmentation on diverse datasets, while using orders of magnitude fewer trainable parameters than full fine-tuning. The findings demonstrate that parameter-efficient prompt tuning can effectively transfer rich multi-modal knowledge to 3D tasks, with clear gains in data efficiency and practical deployment potential.

Abstract

This paper presents a parameter-efficient prompt tuning method, named PPT, to adapt a large multi-modal model for 3D point cloud understanding. Existing strategies are quite expensive in computation and storage, and depend on time-consuming prompt engineering. We address the problems from three aspects. Firstly, a PromptLearner module is devised to replace hand-crafted prompts with learnable contexts to automate the prompt tuning process. Then, we lock the pre-trained backbone instead of adopting the full fine-tuning paradigm to substantially improve the parameter efficiency. Finally, a lightweight PointAdapter module is arranged near target tasks to enhance prompt tuning for 3D point cloud understanding. Comprehensive experiments are conducted to demonstrate the superior parameter and data efficiency of the proposed method.Meanwhile, we obtain new records on 4 public datasets and multiple 3D tasks, i.e., point cloud recognition, few-shot learning, and part segmentation. The implementation is available at https://github.com/auniquesun/PPT.
Paper Structure (20 sections, 12 equations, 5 figures, 6 tables)

This paper contains 20 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Manual Prompts vs. Learnable Contexts. The former needs to find proper prompts manually. The latter learns context vectors adaptively. The accuracy scores are obtained by running ULIP xue23ulip (PointBERT as 3D encoder).
  • Figure 2: The overall architecture of the proposed method. The class name embedding $\textbf{c}_j$ can be inserted in any position of the learnable vectors. Here we insert it in the end for illustration.
  • Figure 3: Comparison of few-shot classification of different methods on two datasets.
  • Figure 4: In figure (a), the data efficiency between ULIP and PPT is compared. In figure (b), we ablate the context length on 4 datasets and the average is displayed in the dashed line.
  • Figure 5: Part segmentation visualization for PPT predictions.