Table of Contents
Fetching ...

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

Yun Liu, Peng Li, Xuefeng Yan, Liangliang Nan, Bing Wang, Honghua Chen, Lina Gong, Wei Zhao, Mingqiang Wei

TL;DR

PointCG tackles data-efficient self-supervised learning for 3D point clouds by unifying two complementary pretext tasks: hidden point completion ($L_{CD}$) and arbitrary-view image generation ($L_G$). It samples visible points with the Hidden Point Removal operator to avoid leaking the object's complete structure, uses an asymmetric Transformer encoder–decoder for 3D completion, and performs cross-modal feature alignment with generated images via a CLIP-based projection to train a discriminative 3D backbone. The framework demonstrates improved 3D reconstruction, shape classification, part and semantic segmentation, and indoor object detection across ShapeNet55, ModelNet40, ScanObjectNN, ShapeNetPart, and S3DIS/ScanNetV2 datasets, with ablations validating the contributions of HPC, AIG, and cross-modal alignment. This approach provides a versatile pretraining paradigm that leverages multi-view geometry and cross-modal supervision to yield robust point-cloud representations for comprehensive 3D understanding.

Abstract

The core of self-supervised point cloud learning lies in setting up appropriate pretext tasks, to construct a pre-training framework that enables the encoder to perceive 3D objects effectively. In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations: ambiguous supervision signals and insensitivity to geometric information. Specifically, the proposed framework, abbreviated as PointCG, consists of a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. We first capture visible points from arbitrary views as inputs by removing hidden points. Then, HPC extracts representations of the inputs with an encoder and completes the entire shape with a decoder, while AIG is used to generate rendered images based on the visible points' representations. Extensive experiments demonstrate the superiority of the proposed method over the baselines in various downstream tasks. Our code will be made available upon acceptance.

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

TL;DR

PointCG tackles data-efficient self-supervised learning for 3D point clouds by unifying two complementary pretext tasks: hidden point completion () and arbitrary-view image generation (). It samples visible points with the Hidden Point Removal operator to avoid leaking the object's complete structure, uses an asymmetric Transformer encoder–decoder for 3D completion, and performs cross-modal feature alignment with generated images via a CLIP-based projection to train a discriminative 3D backbone. The framework demonstrates improved 3D reconstruction, shape classification, part and semantic segmentation, and indoor object detection across ShapeNet55, ModelNet40, ScanObjectNN, ShapeNetPart, and S3DIS/ScanNetV2 datasets, with ablations validating the contributions of HPC, AIG, and cross-modal alignment. This approach provides a versatile pretraining paradigm that leverages multi-view geometry and cross-modal supervision to yield robust point-cloud representations for comprehensive 3D understanding.

Abstract

The core of self-supervised point cloud learning lies in setting up appropriate pretext tasks, to construct a pre-training framework that enables the encoder to perceive 3D objects effectively. In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations: ambiguous supervision signals and insensitivity to geometric information. Specifically, the proposed framework, abbreviated as PointCG, consists of a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. We first capture visible points from arbitrary views as inputs by removing hidden points. Then, HPC extracts representations of the inputs with an encoder and completes the entire shape with a decoder, while AIG is used to generate rendered images based on the visible points' representations. Extensive experiments demonstrate the superiority of the proposed method over the baselines in various downstream tasks. Our code will be made available upon acceptance.

Paper Structure

This paper contains 18 sections, 9 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Qualitative and quantitative comparison of models using different pretext tasks. Chamfer Distance (CD) and Structural Similarity Index (SSIM) are employed as the quantitative metrics. For the masked point modeling (MPM) task, we utilize the method proposed in Point-MAE pang2022masked with the inputs of visible points from arbitrary views (see Sec. \ref{['sec:data_Org']}). For the 3D-to-2D generation task, we define the pretext task as generating images from arbitrary views. The result of the model using only MPM exhibits group clustering at the edges, while our method yields sharpened and clear edges that closely align with the ground truth. The model relying solely on 3D-to-2D generation fails to capture three-dimensional structural information, while our method can effectively preserve the geometric structure. Directly combining both tasks generates point clouds and images superior to using only MPM or 3D-to-2D generation (Direct Combination) but with lower Linear-SVM accuracy.
  • Figure 2: Visualization of the unmasked points (a), the masked points (b), the completed point cloud composed of green unmasked points and gray masked points (c), and the completed point cloud in blue with overlapping points highlighted in red (d).
  • Figure 3: Overview of PointCG. PointCG integrates two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. In detail, we first capture visible points with the HPR katz2007direct operator. Then we utilize an encoder-decoder architecture to extract features from these visible points and complete the original point clouds through the hidden point completion (HPC) module. The arbitrary-view image generation (AIG) module generates images based on the aligned representations of visible points. Note that the input images for feature alignment are randomly selected and do not need to match the target images of image generation.
  • Figure 4: Visualization of the original points ${P}$ in blue, points after spherical flipping $\hat{P}$ in light blue and visible points from $C$ in magenta.
  • Figure 5: The image generator consists of several deconvolutional residual blocks, generating the image from the view of $L_C$.
  • ...and 6 more figures