Table of Contents
Fetching ...

Enhancing Image Generation Fidelity via Progressive Prompts

Zhen Xiong, Yuqi Li, Chuanguang Yang, Tiao Tan, Zhihong Zhu, Siyuan Li, Yue Ma

TL;DR

This work tackles the limited regional prompt control in diffusion-transformer–based text-to-image generation. It introduces DiTPipe, a coarse-to-fine pipeline that uses an LLM to generate high-level content and low-level details, and a Controllable Region-Attention mechanism to inject region-specific prompts at different depths of the DiT backbone. Key components include region mask division, progressive prompting, T5-CLIP text-state fusion, and region-aware cross-attention that operates across multiple DiT blocks. Empirical results on 1024×1024 images show improved fidelity and region-specific control over strong baselines such as SDXL and SD-1.5, with ablations confirming the importance of depth-aware attention and the number of region-attention modules. The approach advances controllability in DiT-based generation and opens avenues for multi-modal extensions in the future.

Abstract

The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.

Enhancing Image Generation Fidelity via Progressive Prompts

TL;DR

This work tackles the limited regional prompt control in diffusion-transformer–based text-to-image generation. It introduces DiTPipe, a coarse-to-fine pipeline that uses an LLM to generate high-level content and low-level details, and a Controllable Region-Attention mechanism to inject region-specific prompts at different depths of the DiT backbone. Key components include region mask division, progressive prompting, T5-CLIP text-state fusion, and region-aware cross-attention that operates across multiple DiT blocks. Empirical results on 1024×1024 images show improved fidelity and region-specific control over strong baselines such as SDXL and SD-1.5, with ablations confirming the importance of depth-aware attention and the number of region-attention modules. The approach advances controllability in DiT-based generation and opens avenues for multi-modal extensions in the future.

Abstract

The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
Paper Structure (10 sections, 6 figures, 1 table)

This paper contains 10 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Visual results of the proposed approach. We enable to generate the image with more details, such as lighting, and objects.
  • Figure 2: The details of proposed region-attention. We present the control pipeline of two regions. Different prompts are applied for specific area guidance. Then, we fuse them to final representation.
  • Figure 3: The details of our proposed block. In the figure, we present the modified block of DiT. In order to improve the fidelity of results, we design the Controllable Region-Attention to improve to achieve more accurate control.
  • Figure 4: The process of prompt generation, we show the process of progressive prompts, including high-level prompts and low-level prompts.
  • Figure 5: Comparison results. We show the comparison with SDXL and SD-1.5. The left of figure is the low-level prompts. We perform the experiment on three setting, including two, four, and nine chunks. Our pipeline obtain the better performance.
  • ...and 1 more figures