PointSmile: Point Self-supervised Learning via Curriculum Mutual Information
Xin Li, Mingqiang Wei, Songcan Chen
TL;DR
PointSmile tackles the challenge of learning transferable 3D point-cloud representations without labels by marrying curriculum learning with mutual information maximization in a decoder-free framework. It introduces Curriculum Data Augmentation to create easy-to-hard replicas and jointly optimizes feature-wise and class-wise mutual information to shape a robust, well-distributed latent space. Across synthetic and real-world benchmarks, PointSmile achieves state-of-the-art or competitive results on object classification and segmentation with standard backbones like PointNet and DGCNN, demonstrating strong generalization. This approach provides a scalable, reconstruction-free alternative to traditional self-supervised methods for 3D understanding with practical impact on downstream tasks.
Abstract
Self-supervised learning is attracting wide attention in point cloud processing. However, it is still not well-solved to gain discriminative and transferable features of point clouds for efficient training on downstream tasks, due to their natural sparsity and irregularity. We propose PointSmile, a reconstruction-free self-supervised learning paradigm by maximizing curriculum mutual information (CMI) across the replicas of point cloud objects. From the perspective of how-and-what-to-learn, PointSmile is designed to imitate human curriculum learning, i.e., starting with an easy curriculum and gradually increasing the difficulty of that curriculum. To solve "how-to-learn", we introduce curriculum data augmentation (CDA) of point clouds. CDA encourages PointSmile to learn from easy samples to hard ones, such that the latent space can be dynamically affected to create better embeddings. To solve "what-to-learn", we propose to maximize both feature- and class-wise CMI, for better extracting discriminative features of point clouds. Unlike most of existing methods, PointSmile does not require a pretext task, nor does it require cross-modal data to yield rich latent representations. We demonstrate the effectiveness and robustness of PointSmile in downstream tasks including object classification and segmentation. Extensive results show that our PointSmile outperforms existing self-supervised methods, and compares favorably with popular fully-supervised methods on various standard architectures.
