Towards Better Text-to-Image Generation Alignment via Attention Modulation
Yihang Wu, Xiao Cao, Kaixin Li, Zitan Chen, Haonan Wang, Lei Meng, Zhiyong Huang
TL;DR
This work addresses entity leakage and attribute misalignment in text-to-image generation when prompts describe multiple entities. It introduces a training-free phase-wise attention-control framework that modulates self-attention via a temperature parameter, cross-attention via an object-focused masking mechanism, and phase-wise reweighting to prioritize semantic components at different diffusion steps. The approach yields better image-text alignment with minimal computational overhead, validated on COCO with Stable Diffusion XL as baseline, using qualitative, quantitative, and semi-human assessments plus ablations. It demonstrates robust improvements for compositional prompts while acknowledging limitations in nested or overlapping attribute scenarios and outlining directions for future work.
Abstract
In text-to-image generation tasks, the advancements of diffusion models have facilitated the fidelity of generated results. However, these models encounter challenges when processing text prompts containing multiple entities and attributes. The uneven distribution of attention results in the issues of entity leakage and attribute misalignment. Training from scratch to address this issue requires numerous labeled data and is resource-consuming. Motivated by this, we propose an attribution-focusing mechanism, a training-free phase-wise mechanism by modulation of attention for diffusion model. One of our core ideas is to guide the model to concentrate on the corresponding syntactic components of the prompt at distinct timesteps. To achieve this, we incorporate a temperature control mechanism within the early phases of the self-attention modules to mitigate entity leakage issues. An object-focused masking scheme and a phase-wise dynamic weight control mechanism are integrated into the cross-attention modules, enabling the model to discern the affiliation of semantic information between entities more effectively. The experimental results in various alignment scenarios demonstrate that our model attain better image-text alignment with minimal additional computational cost.
