Table of Contents
Fetching ...

PointSmile: Point Self-supervised Learning via Curriculum Mutual Information

Xin Li, Mingqiang Wei, Songcan Chen

TL;DR

PointSmile tackles the challenge of learning transferable 3D point-cloud representations without labels by marrying curriculum learning with mutual information maximization in a decoder-free framework. It introduces Curriculum Data Augmentation to create easy-to-hard replicas and jointly optimizes feature-wise and class-wise mutual information to shape a robust, well-distributed latent space. Across synthetic and real-world benchmarks, PointSmile achieves state-of-the-art or competitive results on object classification and segmentation with standard backbones like PointNet and DGCNN, demonstrating strong generalization. This approach provides a scalable, reconstruction-free alternative to traditional self-supervised methods for 3D understanding with practical impact on downstream tasks.

Abstract

Self-supervised learning is attracting wide attention in point cloud processing. However, it is still not well-solved to gain discriminative and transferable features of point clouds for efficient training on downstream tasks, due to their natural sparsity and irregularity. We propose PointSmile, a reconstruction-free self-supervised learning paradigm by maximizing curriculum mutual information (CMI) across the replicas of point cloud objects. From the perspective of how-and-what-to-learn, PointSmile is designed to imitate human curriculum learning, i.e., starting with an easy curriculum and gradually increasing the difficulty of that curriculum. To solve "how-to-learn", we introduce curriculum data augmentation (CDA) of point clouds. CDA encourages PointSmile to learn from easy samples to hard ones, such that the latent space can be dynamically affected to create better embeddings. To solve "what-to-learn", we propose to maximize both feature- and class-wise CMI, for better extracting discriminative features of point clouds. Unlike most of existing methods, PointSmile does not require a pretext task, nor does it require cross-modal data to yield rich latent representations. We demonstrate the effectiveness and robustness of PointSmile in downstream tasks including object classification and segmentation. Extensive results show that our PointSmile outperforms existing self-supervised methods, and compares favorably with popular fully-supervised methods on various standard architectures.

PointSmile: Point Self-supervised Learning via Curriculum Mutual Information

TL;DR

PointSmile tackles the challenge of learning transferable 3D point-cloud representations without labels by marrying curriculum learning with mutual information maximization in a decoder-free framework. It introduces Curriculum Data Augmentation to create easy-to-hard replicas and jointly optimizes feature-wise and class-wise mutual information to shape a robust, well-distributed latent space. Across synthetic and real-world benchmarks, PointSmile achieves state-of-the-art or competitive results on object classification and segmentation with standard backbones like PointNet and DGCNN, demonstrating strong generalization. This approach provides a scalable, reconstruction-free alternative to traditional self-supervised methods for 3D understanding with practical impact on downstream tasks.

Abstract

Self-supervised learning is attracting wide attention in point cloud processing. However, it is still not well-solved to gain discriminative and transferable features of point clouds for efficient training on downstream tasks, due to their natural sparsity and irregularity. We propose PointSmile, a reconstruction-free self-supervised learning paradigm by maximizing curriculum mutual information (CMI) across the replicas of point cloud objects. From the perspective of how-and-what-to-learn, PointSmile is designed to imitate human curriculum learning, i.e., starting with an easy curriculum and gradually increasing the difficulty of that curriculum. To solve "how-to-learn", we introduce curriculum data augmentation (CDA) of point clouds. CDA encourages PointSmile to learn from easy samples to hard ones, such that the latent space can be dynamically affected to create better embeddings. To solve "what-to-learn", we propose to maximize both feature- and class-wise CMI, for better extracting discriminative features of point clouds. Unlike most of existing methods, PointSmile does not require a pretext task, nor does it require cross-modal data to yield rich latent representations. We demonstrate the effectiveness and robustness of PointSmile in downstream tasks including object classification and segmentation. Extensive results show that our PointSmile outperforms existing self-supervised methods, and compares favorably with popular fully-supervised methods on various standard architectures.
Paper Structure (24 sections, 11 equations, 7 figures, 5 tables)

This paper contains 24 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Self-supervised learning is challenging for point clouds. Benefiting from curriculum mutual information, our single-modal PointSmile illustrates clear improvements over the cross-modal CrossPoint DBLP:conf/cvpr/AfhamDDDTR22, and is comparable to the supervised methods DBLP:conf/cvpr/QiSMG17, DBLP:journals/tog/WangSLSBS19 for various downstream tasks. Left: the accuracy of linear SVM classification on ModelNet40 DBLP:conf/cvpr/WuSKYZTX15 and ScanObjectNN DBLP:conf/iccv/UyPHNY19. Right: the mean IoU of part segmentation (top) and semantic segmentation (bottom) on ShapeNetPart DBLP:journals/tog/YiKCSYSLHSG16 and S3DIS armeni20163d. Two backbone networks PointNet DBLP:conf/cvpr/QiSMG17 and DGCNN DBLP:journals/tog/WangSLSBS19 are utilized.
  • Figure 2: Overview of PointSmile. PointSmile is designed to imitate how humans to learn professional knowledge via curriculum learning. It consists of three main components, i.e., i) curriculum data augmentation to construct easy and hard replicas of each 3D object, and to increasingly add the portion of hard replicas during learning; ii) a shared encoder $E$ to learn geometric representations, and iii) two curriculum mutual information (CMI) modules to maximize the feature-wise and class-wise CMI jointly. $x$ denotes an input point cloud batch and $x^{a}$, $x^{b}$ denote two different replicas of $x$ obtained from CDA.
  • Figure 3: Comparison of different ways to obtain augmented samples. (a) An example comparison of three different regimes for the augmented sample pairs in an image. (b) Comparison of easy augmented sample (easy sample) and hard augmented sample (hard sample). We show the geometric transformation of the sample with a uniform xyz coordinate axis. (c) Comparison of SDA and CDA. The length of the rectangle represents the number of augmented samples with the corresponding color. Please note that $\lambda$ is an array that grows over the epochs.
  • Figure 4: t-SNE visualization of the features learned from ModelNet10 after training PointNet as the backbone. The features learned by maximizing both CMI (right) provide better discrimination of classes than using only feature-wise CMI (left) or class-wise CMI (middle).
  • Figure 5: Segmentation results on ShapePart of CrossPoint DBLP:conf/cvpr/AfhamDDDTR22 and PointSmile (DGCNN as the encoder). Different colors represent different parts.
  • ...and 2 more figures