PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation
Yun Liu, Peng Li, Xuefeng Yan, Liangliang Nan, Bing Wang, Honghua Chen, Lina Gong, Wei Zhao, Mingqiang Wei
TL;DR
PointCG tackles data-efficient self-supervised learning for 3D point clouds by unifying two complementary pretext tasks: hidden point completion ($L_{CD}$) and arbitrary-view image generation ($L_G$). It samples visible points with the Hidden Point Removal operator to avoid leaking the object's complete structure, uses an asymmetric Transformer encoder–decoder for 3D completion, and performs cross-modal feature alignment with generated images via a CLIP-based projection to train a discriminative 3D backbone. The framework demonstrates improved 3D reconstruction, shape classification, part and semantic segmentation, and indoor object detection across ShapeNet55, ModelNet40, ScanObjectNN, ShapeNetPart, and S3DIS/ScanNetV2 datasets, with ablations validating the contributions of HPC, AIG, and cross-modal alignment. This approach provides a versatile pretraining paradigm that leverages multi-view geometry and cross-modal supervision to yield robust point-cloud representations for comprehensive 3D understanding.
Abstract
The core of self-supervised point cloud learning lies in setting up appropriate pretext tasks, to construct a pre-training framework that enables the encoder to perceive 3D objects effectively. In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations: ambiguous supervision signals and insensitivity to geometric information. Specifically, the proposed framework, abbreviated as PointCG, consists of a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. We first capture visible points from arbitrary views as inputs by removing hidden points. Then, HPC extracts representations of the inputs with an encoder and completes the entire shape with a decoder, while AIG is used to generate rendered images based on the visible points' representations. Extensive experiments demonstrate the superiority of the proposed method over the baselines in various downstream tasks. Our code will be made available upon acceptance.
