Table of Contents
Fetching ...

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

Randall Balestriero, Nicolas Ballas, Mike Rabbat, Yann LeCun

TL;DR

The paper analyzes Joint Embedding Predictive Architectures (JEPAs) and shows that forcing Gaussian embeddings during JEPA training inherently teaches the data distribution $p_X$ without reconstructing inputs. It derives JEPA-SCORE, a practical density estimator extracted from the encoder's Jacobian, and links the Gaussian embedding condition to an underlying energy function that encodes $p_X$ up to level-set reparameterizations, formalized through a change-of-variables relation. Specifically, JEPA-SCORE(x) = $\sum_{k=1}^{\operatorname{rank}(J_f(x))} \log(\sigma_k(J_f(x)))$, and the latent-density for generators satisfies $p_\mu(\mu) \propto \mathbb{E}_{p_T}\left[1/\prod_k \sigma_k(J_f(\mu,T))\right]^{-1}$, enabling direct density estimation from trained models. Empirical validation across synthetic data and state-of-the-art backbones (e.g., I-JEPA, DINOv2, MetaCLIP) demonstrates JEPA-SCORE's ability to reflect true data density and supports applications in outlier detection and data curation, highlighting a new bridge between JEPA-based representations and score-based density estimation.

Abstract

Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model's Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

TL;DR

The paper analyzes Joint Embedding Predictive Architectures (JEPAs) and shows that forcing Gaussian embeddings during JEPA training inherently teaches the data distribution without reconstructing inputs. It derives JEPA-SCORE, a practical density estimator extracted from the encoder's Jacobian, and links the Gaussian embedding condition to an underlying energy function that encodes up to level-set reparameterizations, formalized through a change-of-variables relation. Specifically, JEPA-SCORE(x) = , and the latent-density for generators satisfies , enabling direct density estimation from trained models. Empirical validation across synthetic data and state-of-the-art backbones (e.g., I-JEPA, DINOv2, MetaCLIP) demonstrates JEPA-SCORE's ability to reflect true data density and supports applications in outlier detection and data curation, highlighting a new bridge between JEPA-based representations and score-based density estimation.

Abstract

Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample efficiently and in closed-form using the model's Jacobian matrix at . Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.

Paper Structure

This paper contains 11 sections, 3 theorems, 9 equations, 7 figures.

Key Result

Lemma 1

As $K$ grows, $X$ quickly concentrates around the hypersphere of radius $1$, converging to a Uniform density over the hypersphere surface. (Proof in proof:uniform.)

Figures (7)

  • Figure 1: Depiction of the 5 least ( left) and 5 most ( right) likely samples of class 21 from Imagenet as per JEPA-SCORE--JEPAs' implicit density estimator learned during pretraining. Two striking observations: (i) across all JEPAs ( rows) the type of samples with low and high probabilities are alike, and (ii) the same samples (amongst 1,000) are found at those extrema. Random samples from that class are provided in \ref{['fig:samples_1']}
  • Figure 2: Top left: Visual illustration of JEPA-SCORE--the DN $f_{{\bm{\theta}}}$ must learn $p_{X}$ for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (\ref{['thm:uniform', 'thm:general_density']}). Top right: Pearson correlation between JEPA-SCORE and the true $log p(x)$ on a GMM data model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone $f_{{\bm{\theta}}}$ internalize the data density which can be easily extracted using our proposed JEPA-SCORE. Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
  • Figure 3: Depiction of JEPA-SCORE for $5,000$ samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset. While our score does not rely on singular vectors, we provide some examples in \ref{['fig:singular_vectors']}. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
  • Figure 4: Random samples from Imagenet-1k training dataset for class 21.
  • Figure 5: Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom:Random samples from Imagenet-1k training dataset for class 141.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • proof
  • proof
  • proof