Table of Contents
Fetching ...

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero, Yann LeCun

TL;DR

This paper derives a principled theory for Joint-Embedding Predictive Architectures (JEPAs), showing that the optimal embedding distribution for minimizing downstream risk is an isotropic Gaussian. It then introduces Sketched Isotropic Gaussian Regularization (SIGReg), a scalable, differentiable distribution-matching objective based on directional tests (notably the Epps-Pulley CF test) that avoids the pitfalls of prior heuristics. By combining SIGReg with the JEPA predictive loss, the authors propose LeJEPA, a simple, hyperparameter-light framework that eliminates collapse, scales to large architectures, and remains robust across domains. Empirically, LeJEPA delivers strong in-domain and cross-domain performance, including competitive ImageNet-1K results with large ViT backbones and superior in-domain pretraining on Galaxy10, while revealing emergent semantic structure in learned representations. The work offers a theory-driven, practical SSL paradigm that reduces reliance on heuristics and enables reliable, scalable self-supervised pretraining for foundation models.

Abstract

Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective--{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)--to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79\% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

TL;DR

This paper derives a principled theory for Joint-Embedding Predictive Architectures (JEPAs), showing that the optimal embedding distribution for minimizing downstream risk is an isotropic Gaussian. It then introduces Sketched Isotropic Gaussian Regularization (SIGReg), a scalable, differentiable distribution-matching objective based on directional tests (notably the Epps-Pulley CF test) that avoids the pitfalls of prior heuristics. By combining SIGReg with the JEPA predictive loss, the authors propose LeJEPA, a simple, hyperparameter-light framework that eliminates collapse, scales to large architectures, and remains robust across domains. Empirically, LeJEPA delivers strong in-domain and cross-domain performance, including competitive ImageNet-1K results with large ViT backbones and superior in-domain pretraining on Galaxy10, while revealing emergent semantic structure in learned representations. The work offers a theory-driven, practical SSL paradigm that reduces reliance on heuristics and enables reliable, scalable self-supervised pretraining for foundation models.

Abstract

Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective--{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)--to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only 50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79\% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).

Paper Structure

This paper contains 58 sections, 16 theorems, 161 equations, 22 figures, 7 tables.

Key Result

lemma 1

Anisotropy amplifies bias Whenever $\lambda_K>\lambda_1$, there always exists a downstream task (${\bm{y}}$) for which ${\bm{Z}}_{\rm aniso}$ produces a higher bias estimator than ${\bm{Z}}_{\rm iso}$ for $\lambda>0$. (Proof in proof:linear_probe_bias.)

Figures (22)

  • Figure 1: LeJEPA overview.Top-left: Training loss exhibits strong correlation with downstream linear probe performance on ImageNet-1k (ViT-base), providing the first practical loss for model selection without supervised probing. Top-right: Training stability without heuristics even on 1.8B ViT-g models, stable training loss. Bottom-left: PCA features from ImageNet-1k pretrained LeJEPA ViT-Large demonstrate clear semantic relationships. Bottom-right: Galaxy10 in-domain results showcasing LeJEPA's in-domain pretraining consistently outperforms state-of-the-art frontier foundation models transfer learning (DINOv2/v3 trained on natural images) across data regimes from 1-shot to full supervision. This demonstrates that domain-specific SSL beats generic transfer learning, even against massive-scale frontier models, when the framework scales effortlessly to any domain, model, and data scale.
  • Figure 2: Sketched Isotropic Gaussian Regularization (SIGReg): Given some arbitrary input data with density $p_{x}$ with support that may or may not lie on a manifold ( left), a Deep network (DN) encoder ($f_{{\bm{\theta}}}$) produces embeddings ${\bm{z}}=f_{{\bm{\theta}}}({\bm{x}})$ with some distribution ${\bm{z}} \sim p_{z}$ ( middle). Our proposed Backward Cramér-Wold Statistics (\ref{['sec:bcs']}) objective pushes $p_z$ to match a target distribution $p_t$ by projecting the embeddings along $1d$ directions ( middle, arrows) and enforcing that the univariate densities ( right, colored lines) match the distribution of $p_t$, projected along the same directions. Any popular statistical test (provided in \ref{['sec:tests']}) can assess the goodness-of-fit--in practice we argue for characteristic function tests (\ref{['sec:CF_better']}). By using SIGReg with $p_t$ isotropic Gaussian ( right, black lines), we introduce a lean and provably optimal (\ref{['sec:gaussian']}) JEPA, coined LeJEPA, free of numerous heuristics and able to produce competitive performances (\ref{['sec:lejepa', 'sec:experiments']}).
  • Figure 3: Illustration of \ref{['thm:linear_probe_variance']} showcasing how anisotropic ( right) embeddings lead to higher variance estimator compared to isotropic embeddings ( left). We sample $100$ training points for the $2$-class classification task and fit a logistic regression--repeating the process over numerous training set sample. Each sampling results in a decision boundary ( purple).
  • Figure 4: Examples of distributions living on the surface of the sphere with varying Sobolev smoothness coefficients $\alpha$. As per \ref{['thm:spherical_bounds']}, the greater $\alpha$ is, the more global will be the impact of SIGReg for a given number of directions $M$. Practically, this represents the distribution of the encoder's output. Because the target density (isotropic Gaussian) is smooth, the $\alpha$ coeffcients of the embedding will quickly grow hereby making SIGReg (\ref{['def:bcs']}) immune to the curse of dimensionality.
  • Figure 5: Constructed data density with "X" distribution whose marginals are standard Gaussian and whose covariance is identity ( left densities). Applying $M=10$ projections on the half circle directions produces $10$ univariate distributions that can be compared against a standard Gaussian ( left) using any preferred statistic from \ref{['sec:tests']}. The appropriate direction is able to capture the degenerate distribution of the data hereby creating a spike in the statistic value.
  • ...and 17 more figures

Theorems & Definitions (33)

  • definition 1
  • lemma 1: label=thm:linear_probe_bias
  • lemma 2: label=thm:linear_probe_variance
  • theorem 1: label=thm:nonlinear_optimal
  • lemma 3: label=thm:spherical_cramer
  • theorem 2: label=thm:bcs
  • definition 2: label=def:bcs
  • theorem 3: label=thm:moment_conendrum
  • theorem 4: label=thm:ecf_stability
  • theorem 5: label=thm:spherical_bounds
  • ...and 23 more