Enhancing Image Generation Fidelity via Progressive Prompts
Zhen Xiong, Yuqi Li, Chuanguang Yang, Tiao Tan, Zhihong Zhu, Siyuan Li, Yue Ma
TL;DR
This work tackles the limited regional prompt control in diffusion-transformer–based text-to-image generation. It introduces DiTPipe, a coarse-to-fine pipeline that uses an LLM to generate high-level content and low-level details, and a Controllable Region-Attention mechanism to inject region-specific prompts at different depths of the DiT backbone. Key components include region mask division, progressive prompting, T5-CLIP text-state fusion, and region-aware cross-attention that operates across multiple DiT blocks. Empirical results on 1024×1024 images show improved fidelity and region-specific control over strong baselines such as SDXL and SD-1.5, with ablations confirming the importance of depth-aware attention and the number of region-attention modules. The approach advances controllability in DiT-based generation and opens avenues for multi-modal extensions in the future.
Abstract
The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
