Table of Contents
Fetching ...

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, Tiviatis Sim, Kenji Kawaguchi

TL;DR

This work introduces attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt, and compares its approach with alternative approaches across various datasets, evaluation metrics, and diffusion models.

Abstract

Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended semantics of the associated text prompts. We examine cross-attention layers in diffusion models and observe a propensity for these layers to disproportionately focus on certain tokens during the generation process, thereby undermining semantic fidelity. To address the issue of dominant attention, we introduce attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Notably, our method requires no additional training or fine-tuning and serves as a plug-in module on a model. Hence, the generation capacity of the original model is fully preserved. We compare our approach with alternative approaches across various datasets, evaluation metrics, and diffusion models. Experiment results show that our method consistently outperforms other baselines, yielding images that more faithfully reflect the desired concepts with reduced computation overhead. Code is available at https://github.com/YaNgZhAnG-V5/attention_regulation.

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

TL;DR

This work introduces attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt, and compares its approach with alternative approaches across various datasets, evaluation metrics, and diffusion models.

Abstract

Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended semantics of the associated text prompts. We examine cross-attention layers in diffusion models and observe a propensity for these layers to disproportionately focus on certain tokens during the generation process, thereby undermining semantic fidelity. To address the issue of dominant attention, we introduce attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Notably, our method requires no additional training or fine-tuning and serves as a plug-in module on a model. Hence, the generation capacity of the original model is fully preserved. We compare our approach with alternative approaches across various datasets, evaluation metrics, and diffusion models. Experiment results show that our method consistently outperforms other baselines, yielding images that more faithfully reflect the desired concepts with reduced computation overhead. Code is available at https://github.com/YaNgZhAnG-V5/attention_regulation.
Paper Structure (12 sections, 13 equations, 8 figures, 3 tables)

This paper contains 12 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Attention regulation effectively improves semantics alignment with prompts by modifying the cross-attention maps at inference time without fine-tuning the model. Moreover, attention regulation requires only additional information on target tokens and achieves inference time comparable to that of the original model. Attention regulation serves as a plug-in module and can be disabled anytime to use the original model.
  • Figure 2: Illustration of attention dominance. The violin plots display the attention statistics for one cross-attention layer across two image samples, both prompted by "A painting of an elephant with glasses." At the initial diffusion step $0$ (middle column), the attention patterns are similar for both samples. By step $24$ (third column), a significant divergence is evident. For the successful sample (bottom row), the attention allocated to "elephant" and "glasses" is approximately equal, suggesting a balanced representation. In contrast, for the sample that fails to include glasses (top row), attention disproportionately favors the token "elephant," marginalizing other relevant tokens (red arrow). More results are in the Appendix \ref{['appx:more_attention_stats']} .
  • Figure 3: Visualization of the optimization outcome. The given prompt is "A bedroom with a book on the bed". By creating regions with high attention values for the target tokens while maintaining the consistency of the attention maps across diffusion steps, the desired targets are successfully generated.
  • Figure 4: A qualitative comparison of the images generated by previous approaches and our approach. More samples in Appendix \ref{['appx:more_visual_comparison']}.
  • Figure 5: Ablation study on layers (\ref{['fig:layer_ablation']}), diffusion steps (\ref{['fig:step_ablation']}) to perform attention regulation and the $\beta$ regularisation term (\ref{['fig:beta_ablation']}). Attention regulation performance increases initially by adding more layers and diffusion steps for editing, but saturates when reaching edit layer $4$ and edit steps $25$. The performance increases as $\beta$ increases until $0.1$.
  • ...and 3 more figures