Adapt PointFormer: 3D Point Cloud Analysis via Adapting 2D Visual Transformers
Mengke Li, Da Li, Guoqing Yang, Yiu-ming Cheung, Hui Huang
TL;DR
This work tackles the challenge of leveraging pre-trained 2D visual transformers for 3D point cloud analysis given limited 3D data. It introduces Adaptive PointFormer (APF), which keeps a 2D ViT backbone frozen and adds a compact PointFormer module plus a lightweight Point Embed, enabling 3D point clouds to be processed without image projection. A key novelty is the Morton-order-based sequencing of point embeddings to align 3D tokens with 2D attention priors, coupled with a bottleneck PEFT strategy that calibrates cross-domain attention with a small number of learnable parameters. Empirical results on ModelNet40, ScanObjectNN, and ShapeNetPart demonstrate that APF can outperform several 3D pre-trained methods and deliver strong performance with reduced training costs, highlighting the viability and practicality of transferring 2D priors to 3D data.
Abstract
Pre-trained large-scale models have exhibited remarkable efficacy in computer vision, particularly for 2D image analysis. However, when it comes to 3D point clouds, the constrained accessibility of data, in contrast to the vast repositories of images, poses a challenge for the development of 3D pre-trained models. This paper therefore attempts to directly leverage pre-trained models with 2D prior knowledge to accomplish the tasks for 3D point cloud analysis. Accordingly, we propose the Adaptive PointFormer (APF), which fine-tunes pre-trained 2D models with only a modest number of parameters to directly process point clouds, obviating the need for mapping to images. Specifically, we convert raw point clouds into point embeddings for aligning dimensions with image tokens. Given the inherent disorder in point clouds, in contrast to the structured nature of images, we then sequence the point embeddings to optimize the utilization of 2D attention priors. To calibrate attention across 3D and 2D domains and reduce computational overhead, a trainable PointFormer with a limited number of parameters is subsequently concatenated to a frozen pre-trained image model. Extensive experiments on various benchmarks demonstrate the effectiveness of the proposed APF. The source code and more details are available at https://vcc.tech/research/2024/PointFormer.
