Table of Contents
Fetching ...

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Siheng Wan, Zhengtao Yao, Zhengdao Li, Junhao Dong, Yanshu Li, Yikai Li, Linshan Li, Haoyan Xu, Yijiang Li, Zhikang Dong, Huacan Wang, Jifeng Shen

TL;DR

JEPA-T tackles efficient text-to-image generation by fusing text with visual tokens in a unified Joint-Embedding Predictive Architecture. It tokenizes images with a VAE and text with a CLIP encoder, processes both in a shared Transformer, and uses two text-conditioning channels—input-level injection and post-predictor cross-attention—alongside objective-level alignment before flow matching. The model is trained with a masked prediction objective plus a conditional flow-matching loss, enabling open-vocabulary generation without pixel-level reconstruction. Results on ImageNet-1K demonstrate strong data efficiency and improved alignment over non-fusion and late-fusion baselines, validating the viability of staged fusion within a task-agnostic backbone for scalable multimodal generation.

Abstract

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

TL;DR

JEPA-T tackles efficient text-to-image generation by fusing text with visual tokens in a unified Joint-Embedding Predictive Architecture. It tokenizes images with a VAE and text with a CLIP encoder, processes both in a shared Transformer, and uses two text-conditioning channels—input-level injection and post-predictor cross-attention—alongside objective-level alignment before flow matching. The model is trained with a masked prediction objective plus a conditional flow-matching loss, enabling open-vocabulary generation without pixel-level reconstruction. Results on ImageNet-1K demonstrate strong data efficiency and improved alignment over non-fusion and late-fusion baselines, validating the viability of staged fusion within a task-agnostic backbone for scalable multimodal generation.

Abstract

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

Paper Structure

This paper contains 11 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: We show selected examples of class/text-conditional generation on ImageNet 256×256 using JEPA-T with Flow matching loss.
  • Figure 2: Overview of JEPA-T. An image is tokenized with a subset masked. Context tokens are encoded and combined with text embeddings in the predictor, while the EMA target encoder provides full-token supervision.Text is injected at two stages: (1) input-level injection into the predictor, biasing denoising dynamics with semantic intent, and (2) post-predictor cross-attention, refining visual tokens with high-resolution cues. Training is guided by a masked prediction loss and a conditional flow-matching loss.