CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Avishka Perera; Kumal Hewagamage; Saeedha Nazar; Kavishka Abeywardana; Hasitha Gallella; Ranga Rodrigo; Mohamed Afham

CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Avishka Perera, Kumal Hewagamage, Saeedha Nazar, Kavishka Abeywardana, Hasitha Gallella, Ranga Rodrigo, Mohamed Afham

TL;DR

CrossJEPA introduces a lightweight, cross-modal Joint-Embedding Predictive Architecture that transfers knowledge from a frozen 2D image foundation model to 3D point clouds without relying on masking. By predicting rendered-view image embeddings from 3D points with pose-based latent conditioning and caching embeddings, it delivers strong 3D representations with far fewer parameters and training hours than prior cross-modal SSL methods. The approach achieves state-of-the-art linear probing on ModelNet40 and competitive results on real-world ScanObjectNN, while offering substantial data efficiency and practical training speedups. The work also provides information-theoretic and predictive-coding justifications for its design and establishes a foundation for efficient, resource-conscious 3D representation learning via cross-modal knowledge distillation.

Abstract

Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.

CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

TL;DR

Abstract

CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)