Training-free Regional Prompting for Diffusion Transformers
Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang
TL;DR
Long and spatially structured prompts challenge prompt following in diffusion models. This work introduces a training-free regional prompting method for the Diffusion Transformer FLUX.1 that modulates attention via region masks to achieve fine-grained, region-specific image synthesis. It defines a region-aware attention mechanism with separate regional and base latents blended by a beta parameter, enabling multi-region compositional generation without retraining. Empirical results show competitive prompt adherence with reduced inference cost compared to training-based regional controls and compatibility with plug-in modules like LoRAs and ControlNet.
Abstract
Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.
