High-Resolution Image Synthesis via Next-Token Prediction
Dengsheng Chen, Jie Hu, Tiezhu Yue, Xiaoming Wei, Enhua Wu
TL;DR
This work presents D-JEPA·T2I, an autoregressive framework for high-resolution text-to-image synthesis that extends next-token prediction to 4K outputs. It combines a D-JEPA-based architecture with a multimodal visual transformer, a flow matching objective, and the novel Visual Rotary Positional Embedding (VoPE) to support continuous resolutions and arbitrary aspect ratios. A data-feedback training strategy, comprising Statistical Analysis Sampling and an online Critic Model, guides data selection and focuses learning on challenging cases, achieving state-of-the-art results among autoregressive methods on GenEval and T2I-CompBench benchmarks and strong human ratings. The approach enables efficient, flexible high-resolution generation and highlights avenues for future work in scaling, video generation, and unified multimodal modeling.
Abstract
Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce \textbf{D-JEPA$\cdot$T2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback mechanism that dynamically adjusts the sampling procedure based on statistical analysis and an online learning critic model. This encourages the model to move beyond its comfort zone, reducing redundant training on well-mastered scenarios and compelling it to address more challenging cases with suboptimal generation quality. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.
