Table of Contents
Fetching ...

Denoising with a Joint-Embedding Predictive Architecture

Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu

TL;DR

D-JEPA presents a unified framework that integrates joint-embedding predictive architectures with diffusion-style token modeling to enable high-quality, autoregressive generation of continuous data. By recasting masked image modeling as generalized next-token prediction and coupling a diffusion or flow-matching objective at the token level, D-JEPA achieves state-of-the-art image synthesis on ImageNet across multiple model scales, while also demonstrating potential for video, audio, and multimodal generation. The method relies on three ViT backbones (context, target, predictor) with EMA updates to the target encoder and a two-loss objective L = L_d + L_p to prevent collapse and inject semantic structure, respectively. Empirically, scaling improves both generation fidelity and sampling efficiency, with large models requiring fewer autoregressive steps and producing fast, high-quality samples; qualitative results show realistic portraits and complex scenes, and extended experiments indicate broad applicability beyond images. The work suggests a path toward unified, multi-modal generative systems for continuous data, while acknowledging optimization bottlenecks in the lightweight denoising MLP and attention-based inefficiencies that warrant further research.

Abstract

Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of D-JEPA. Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on ImageNet conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio.

Denoising with a Joint-Embedding Predictive Architecture

TL;DR

D-JEPA presents a unified framework that integrates joint-embedding predictive architectures with diffusion-style token modeling to enable high-quality, autoregressive generation of continuous data. By recasting masked image modeling as generalized next-token prediction and coupling a diffusion or flow-matching objective at the token level, D-JEPA achieves state-of-the-art image synthesis on ImageNet across multiple model scales, while also demonstrating potential for video, audio, and multimodal generation. The method relies on three ViT backbones (context, target, predictor) with EMA updates to the target encoder and a two-loss objective L = L_d + L_p to prevent collapse and inject semantic structure, respectively. Empirically, scaling improves both generation fidelity and sampling efficiency, with large models requiring fewer autoregressive steps and producing fast, high-quality samples; qualitative results show realistic portraits and complex scenes, and extended experiments indicate broad applicability beyond images. The work suggests a path toward unified, multi-modal generative systems for continuous data, while acknowledging optimization bottlenecks in the lightweight denoising MLP and attention-based inefficiencies that warrant further research.

Abstract

Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of D-JEPA. Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on ImageNet conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio.
Paper Structure (58 sections, 13 equations, 17 figures, 12 tables, 1 algorithm)

This paper contains 58 sections, 13 equations, 17 figures, 12 tables, 1 algorithm.

Figures (17)

  • Figure 1: Data flow during D-JEPA training. Initially, the training data is divided into non-overlapping semantic tokens, which can be either in the raw space or in the latent space obtained after VAE encoding. A random subset of these input tokens is then masked. The feature predictor $\gamma$ is employed to predict features for these masked tokens, utilizing the unmasked tokens as contextual information. Each masked token is concurrently subjected to a diffusion loss (or flow matching loss) to learn the distribution of each token $p(x_i|z_i)$, independently. Additionally, a prediction loss is applied, compelling each masked token to regress towards the target tokens $g_i$.
  • Figure 2:
  • Figure 3: D-JEPA achieves state-of-the-art image quality. We showcase selected high-fidelity examples of class-conditional generation on ImageNet $256\times256$ using D-JEPA-H.
  • Figure 4: Grid searching for CFG scale and temperature $\tau$. FID=1.199 and IS=302.69 represent the benchmarks achievable by VAE, which are the theoretical upper limits for D-JEPA.
  • Figure 5: Uncurated$256 \times 256$ D-JEPA-H samples with different auto-regressive steps. $\text{cfg}=3.9, \tau=0.99$.
  • ...and 12 more figures