Table of Contents
Fetching ...

Training-free Regional Prompting for Diffusion Transformers

Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang

TL;DR

Long and spatially structured prompts challenge prompt following in diffusion models. This work introduces a training-free regional prompting method for the Diffusion Transformer FLUX.1 that modulates attention via region masks to achieve fine-grained, region-specific image synthesis. It defines a region-aware attention mechanism with separate regional and base latents blended by a beta parameter, enabling multi-region compositional generation without retraining. Empirical results show competitive prompt adherence with reduced inference cost compared to training-based regional controls and compatibility with plug-in modules like LoRAs and ControlNet.

Abstract

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

Training-free Regional Prompting for Diffusion Transformers

TL;DR

Long and spatially structured prompts challenge prompt following in diffusion models. This work introduces a training-free regional prompting method for the Diffusion Transformer FLUX.1 that modulates attention via region masks to achieve fine-grained, region-specific image synthesis. It defines a region-aware attention mechanism with separate regional and base latents blended by a beta parameter, enabling multi-region compositional generation without retraining. Empirical results show competitive prompt adherence with reduced inference cost compared to training-based regional controls and compatibility with plug-in modules like LoRAs and ControlNet.

Abstract

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

Paper Structure

This paper contains 7 sections, 7 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of our method. Given user-defined or LLM-generated regional prompt-mask pairs, we can effectively achieve fine-grained compositional text-to-image generation.
  • Figure 2: Main results. Simplified regional prompts are colored according to the layout mask. In practice, we input more detailed regional prompt about each region.
  • Figure 3: Illustration of our Region-Aware Attention Manipulation module. The unified self-attention in FLUX can be broken down into four parts: cross-attention from image to text, cross-attention from text to image, and self-attention between image. After calculating the attention manipulation mask, we merge them to get the overall attention mask that is later fed into the attention calculation process.
  • Figure 4: Results with LoRAs and ControlNet. Colored prompts and masks are provided for the regional control for each example. The control image (pose & depth-map) for controlnet is attached within the left image. Zoom in to see in detail.
  • Figure 5: Ablation results with base ratio $\beta$, control steps $T$ and control blocks $B$.
  • ...and 1 more figures