Table of Contents
Fetching ...

High-Resolution Image Synthesis via Next-Token Prediction

Dengsheng Chen, Jie Hu, Tiezhu Yue, Xiaoming Wei, Enhua Wu

TL;DR

This work presents D-JEPA·T2I, an autoregressive framework for high-resolution text-to-image synthesis that extends next-token prediction to 4K outputs. It combines a D-JEPA-based architecture with a multimodal visual transformer, a flow matching objective, and the novel Visual Rotary Positional Embedding (VoPE) to support continuous resolutions and arbitrary aspect ratios. A data-feedback training strategy, comprising Statistical Analysis Sampling and an online Critic Model, guides data selection and focuses learning on challenging cases, achieving state-of-the-art results among autoregressive methods on GenEval and T2I-CompBench benchmarks and strong human ratings. The approach enables efficient, flexible high-resolution generation and highlights avenues for future work in scaling, video generation, and unified multimodal modeling.

Abstract

Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce \textbf{D-JEPA$\cdot$T2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback mechanism that dynamically adjusts the sampling procedure based on statistical analysis and an online learning critic model. This encourages the model to move beyond its comfort zone, reducing redundant training on well-mastered scenarios and compelling it to address more challenging cases with suboptimal generation quality. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.

High-Resolution Image Synthesis via Next-Token Prediction

TL;DR

This work presents D-JEPA·T2I, an autoregressive framework for high-resolution text-to-image synthesis that extends next-token prediction to 4K outputs. It combines a D-JEPA-based architecture with a multimodal visual transformer, a flow matching objective, and the novel Visual Rotary Positional Embedding (VoPE) to support continuous resolutions and arbitrary aspect ratios. A data-feedback training strategy, comprising Statistical Analysis Sampling and an online Critic Model, guides data selection and focuses learning on challenging cases, achieving state-of-the-art results among autoregressive methods on GenEval and T2I-CompBench benchmarks and strong human ratings. The approach enables efficient, flexible high-resolution generation and highlights avenues for future work in scaling, video generation, and unified multimodal modeling.

Abstract

Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce \textbf{D-JEPAT2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback mechanism that dynamically adjusts the sampling procedure based on statistical analysis and an online learning critic model. This encourages the model to move beyond its comfort zone, reducing redundant training on well-mastered scenarios and compelling it to address more challenging cases with suboptimal generation quality. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.

Paper Structure

This paper contains 52 sections, 8 equations, 23 figures, 4 tables, 1 algorithm.

Figures (23)

  • Figure 1: D-JEPA$\cdot$T2I can accurately generate high-fidelity, high-resolution images across various aspect ratios. Refer to the supplementary materials for 4K resolution samples and additional qualitative results.
  • Figure 2: Denoising with a Joint-Embedding Predictive Architecture for text-to-image synthesis. We employ T5-XXL chung2024scaling as the text encoder, and the KL-VAE pretrained by esser2024scaling as the image encoder. Both textual and visual tokens are trimmed to no more than 256 and $256^2$ tokens for efficient training, respectively. The feature predictor $\gamma$, the context encoder $\phi$, and the target encoder $\bar{\phi}$ share the same network architecture, each consisting of several multimodal visual transformer blocks. The gradient is detached from the output of the target encoder $\bar{\phi}$, ensuring that it is only updated via exponential moving average (EMA). Both the prediction loss $\mathcal{L}_{\text{pred}}$ and the flow matching loss $\mathcal{L}_{\text{flow}}$ are computed only for the masked visual tokens, following chen2024denoising.
  • Figure 3: Comparison of decay curves between RoPE and VoPE.
  • Figure 4: Training procedure incorporating data feedback. The evaluation result will be used to train the critic model.
  • Figure 5: The pipeline to prepare training set for critic model.
  • ...and 18 more figures