Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud
Ayumu Saito, Prachi Kudeshia, Jiju Poovvancheri
TL;DR
Point-JEPA presents a self-supervised learning framework for point clouds that applies Joint Embedding Predictive Architecture to embeddings rather than raw inputs, using a greedy sequencer to enforce spatially coherent target-context blocks. By predicting target embeddings from context embeddings in embedding space, the method avoids reconstruction in the input space and reduces reliance on extra modalities, achieving faster pre-training and strong downstream performance. Empirical results on ShapeNet pretraining show competitive linear and few-shot accuracy, with ablations demonstrating the effectiveness of masking strategies, sequencer design, and the number of target blocks. The approach provides a scalable, efficient pathway for learning robust point-cloud representations with potential extensions to detection and scene understanding.
Abstract
Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud patch embeddings to efficiently compute and utilize their proximity based on the indices during target and context selection. The sequencer also allows shared computations of the patch embeddings' proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.
