Towards Better Text-to-Image Generation Alignment via Attention Modulation

Yihang Wu; Xiao Cao; Kaixin Li; Zitan Chen; Haonan Wang; Lei Meng; Zhiyong Huang

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Yihang Wu, Xiao Cao, Kaixin Li, Zitan Chen, Haonan Wang, Lei Meng, Zhiyong Huang

TL;DR

This work addresses entity leakage and attribute misalignment in text-to-image generation when prompts describe multiple entities. It introduces a training-free phase-wise attention-control framework that modulates self-attention via a temperature parameter, cross-attention via an object-focused masking mechanism, and phase-wise reweighting to prioritize semantic components at different diffusion steps. The approach yields better image-text alignment with minimal computational overhead, validated on COCO with Stable Diffusion XL as baseline, using qualitative, quantitative, and semi-human assessments plus ablations. It demonstrates robust improvements for compositional prompts while acknowledging limitations in nested or overlapping attribute scenarios and outlining directions for future work.

Abstract

In text-to-image generation tasks, the advancements of diffusion models have facilitated the fidelity of generated results. However, these models encounter challenges when processing text prompts containing multiple entities and attributes. The uneven distribution of attention results in the issues of entity leakage and attribute misalignment. Training from scratch to address this issue requires numerous labeled data and is resource-consuming. Motivated by this, we propose an attribution-focusing mechanism, a training-free phase-wise mechanism by modulation of attention for diffusion model. One of our core ideas is to guide the model to concentrate on the corresponding syntactic components of the prompt at distinct timesteps. To achieve this, we incorporate a temperature control mechanism within the early phases of the self-attention modules to mitigate entity leakage issues. An object-focused masking scheme and a phase-wise dynamic weight control mechanism are integrated into the cross-attention modules, enabling the model to discern the affiliation of semantic information between entities more effectively. The experimental results in various alignment scenarios demonstrate that our model attain better image-text alignment with minimal additional computational cost.

Towards Better Text-to-Image Generation Alignment via Attention Modulation

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 7 figures, 3 tables)

This paper contains 15 sections, 7 equations, 7 figures, 3 tables.

Introduction
Related Work
Proposed Method
Self-Attention Control
Cross-Attention Control
Object-focused masking mechanism
Phase-wise Dynamic Reweighting
Experiments
Qualitative Analysis
Quantitative Analysis
Benchmark Evaluation
Semi-human Evaluation for Alignment Performance
Ablation Experiment
Discussion
Conclusion

Figures (7)

Figure 1: The overall pipeline of our methods. In the self-attention module, we employ a temperature control strategy to better construct the outlines of the entities. In the cross-attention layers, we integrate an object-focused masking mechanism and a dynamic reweighting mechanism to emphasize different components of the prompt at various stages.
Figure 2: A display of the effects of self-attention temperature control on the self-attention map. Given the prompt "a boy in front of a female", we visualized the attention values between a patch within the highlighted red box in figures (b) and (d) and other patches throughout the diffusion process. After applying temperature control, the patch's high response region became more confined, thereby forming more accurate outlines.
Figure 3: A comparison of cross-attention maps of each token original model and our method at timestep 30 with different methods. Given the prompt of "a boy in front of a female", we can observe that the semantic information of some tokens is spread throughout the entire image in the original model, resulting in poor alignment. With our control method, the token information corresponding to entities becomes more aggregated, ultimately yielding better generative results.
Figure 4: Attention maps at different timesteps in original model and our methods. We visualized the distribution of attention maps at different time steps given the prompt "a boy in front of a female". Compared to the original model, after our control, the attention map can construct the outlines of entities earlier. This is particularly evident in the third and fourth timestep stages, where the dynamic reweighting mechanism allows for earlier and better differentiation between the image background and the entities.
Figure 5: The qualitative results given the prompt. In our qualitative experiments, we conducted extensive tests on Stable Diffusion XL, Structured Diffusion, and our method. The results indicate that our approach generates images that are closer to the prompt, and more accurately align in terms of quantity, attributes, and object alignment, as per expectations.
...and 2 more figures

Towards Better Text-to-Image Generation Alignment via Attention Modulation

TL;DR

Abstract

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)