Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding
Jiachen Kang, Wenjing Jia, Xiangjian He, Kin Man Lam
TL;DR
This work addresses the data scarcity and annotation burden in 3D point cloud understanding by reframing point clouds as specialized images and leveraging large-scale image knowledge through a shared Transformer backbone. The proposed PCExpert architecture freezes the image backbone and introduces a point-specific module that shares the Vision Transformer’s multi-head attention while employing modality-specific FFNs, enabling deep cross-modal knowledge transfer with only a small fraction of trainable parameters. The learning objective combines cross-modal and intra-modal contrastive losses with a novel transformation parameter estimation task, delivering state-of-the-art results on ScanObjectNN in LINEAR and strong performance in few-shot and full-finetuning settings. Overall, PCExpert demonstrates that substantial image-based priors can meaningfully improve 3D understanding, offering a scalable path for multi-modal Transformer-based point cloud learning and dataset-efficient development.
Abstract
Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.
