Table of Contents
Fetching ...

Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud

Ayumu Saito, Prachi Kudeshia, Jiju Poovvancheri

TL;DR

Point-JEPA presents a self-supervised learning framework for point clouds that applies Joint Embedding Predictive Architecture to embeddings rather than raw inputs, using a greedy sequencer to enforce spatially coherent target-context blocks. By predicting target embeddings from context embeddings in embedding space, the method avoids reconstruction in the input space and reduces reliance on extra modalities, achieving faster pre-training and strong downstream performance. Empirical results on ShapeNet pretraining show competitive linear and few-shot accuracy, with ablations demonstrating the effectiveness of masking strategies, sequencer design, and the number of target blocks. The approach provides a scalable, efficient pathway for learning robust point-cloud representations with potential extensions to detection and scene understanding.

Abstract

Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud patch embeddings to efficiently compute and utilize their proximity based on the indices during target and context selection. The sequencer also allows shared computations of the patch embeddings' proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.

Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud

TL;DR

Point-JEPA presents a self-supervised learning framework for point clouds that applies Joint Embedding Predictive Architecture to embeddings rather than raw inputs, using a greedy sequencer to enforce spatially coherent target-context blocks. By predicting target embeddings from context embeddings in embedding space, the method avoids reconstruction in the input space and reduces reliance on extra modalities, achieving faster pre-training and strong downstream performance. Empirical results on ShapeNet pretraining show competitive linear and few-shot accuracy, with ablations demonstrating the effectiveness of masking strategies, sequencer design, and the number of target blocks. The approach provides a scalable, efficient pathway for learning robust point-cloud representations with potential extensions to detection and scene understanding.

Abstract

Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud patch embeddings to efficiently compute and utilize their proximity based on the indices during target and context selection. The sequencer also allows shared computations of the patch embeddings' proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.
Paper Structure (35 sections, 2 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 2 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: ModelNet40 Linear Evaluation. Pre-training time on NVIDIA RTX A5500 and overall accuracy with SVM linear classifier on ModelNet40 ModelNet. We compare PointJEPA with previous methods utilizing standard Transformer architecture.
  • Figure 2: Schematic renderings illustrating the process of creating embeddings. (Top left), point encoder (bottom left) and Point-JEPA (right). Point cloud patches are generated using furthest point sampling (FPS) fps and $k$-nearest neighbor (KNN) methods, a mini PointNet (Point Encoder) is used to generate patch embeddings which are subsequently fed to the JEPA architecture. We use standard Transformer Transformer architecture for context ($f_{\theta}$) and target ($f_{\overline{\theta}}$) encoders as well as predictor ($p_{\phi}$).
  • Figure 3: Context and Targets. We visualize the corresponding grouped points of context and target blocks. Here, we use (0.15, 0.2) for the target selection ratio and (0.4, 0.75) for the context selection ratio.
  • Figure 4: Embedding Visualization on ModelNet40ModelNet. We visualize the context encoder's learned representation with t-SNE tsne.
  • Figure 5: Confusion matrices illustrating model performance on ModelNet40 and another dataset, highlighting class-specific accuracies and challenges with similar categories.