JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Siheng Wan; Zhengtao Yao; Zhengdao Li; Junhao Dong; Yanshu Li; Yikai Li; Linshan Li; Haoyan Xu; Yijiang Li; Zhikang Dong; Huacan Wang; Jifeng Shen

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Siheng Wan, Zhengtao Yao, Zhengdao Li, Junhao Dong, Yanshu Li, Yikai Li, Linshan Li, Haoyan Xu, Yijiang Li, Zhikang Dong, Huacan Wang, Jifeng Shen

TL;DR

JEPA-T tackles efficient text-to-image generation by fusing text with visual tokens in a unified Joint-Embedding Predictive Architecture. It tokenizes images with a VAE and text with a CLIP encoder, processes both in a shared Transformer, and uses two text-conditioning channels—input-level injection and post-predictor cross-attention—alongside objective-level alignment before flow matching. The model is trained with a masked prediction objective plus a conditional flow-matching loss, enabling open-vocabulary generation without pixel-level reconstruction. Results on ImageNet-1K demonstrate strong data efficiency and improved alignment over non-fusion and late-fusion baselines, validating the viability of staged fusion within a task-agnostic backbone for scalable multimodal generation.

Abstract

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

TL;DR

Abstract

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)