Denoising with a Joint-Embedding Predictive Architecture
Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
TL;DR
D-JEPA presents a unified framework that integrates joint-embedding predictive architectures with diffusion-style token modeling to enable high-quality, autoregressive generation of continuous data. By recasting masked image modeling as generalized next-token prediction and coupling a diffusion or flow-matching objective at the token level, D-JEPA achieves state-of-the-art image synthesis on ImageNet across multiple model scales, while also demonstrating potential for video, audio, and multimodal generation. The method relies on three ViT backbones (context, target, predictor) with EMA updates to the target encoder and a two-loss objective L = L_d + L_p to prevent collapse and inject semantic structure, respectively. Empirically, scaling improves both generation fidelity and sampling efficiency, with large models requiring fewer autoregressive steps and producing fast, high-quality samples; qualitative results show realistic portraits and complex scenes, and extended experiments indicate broad applicability beyond images. The work suggests a path toward unified, multi-modal generative systems for continuous data, while acknowledging optimization bottlenecks in the lightweight denoising MLP and attention-based inefficiencies that warrant further research.
Abstract
Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of D-JEPA. Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on ImageNet conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio.
