Table of Contents
Fetching ...

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or

TL;DR

This work targets the challenge of faithfully generating images containing multiple similar subjects by identifying semantic leakage in attention mechanisms as a key bottleneck. It introduces Bounded Attention, a training-free method that bounds information flow in cross- and self-attention during both guidance and denoising, guided by input bounding boxes and refined via mask clustering. Across SD and SDXL, the approach yields improved layout fidelity and subject individuality, with quantitative gains on DrawBench and supportive user studies. The method enables complex, multi-subject prompts to be realized more faithfully, offering a practical tool for layout-controlled diffusion without additional training.

Abstract

Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

TL;DR

This work targets the challenge of faithfully generating images containing multiple similar subjects by identifying semantic leakage in attention mechanisms as a key bottleneck. It introduces Bounded Attention, a training-free method that bounds information flow in cross- and self-attention during both guidance and denoising, guided by input bounding boxes and refined via mask clustering. Across SD and SDXL, the approach yields improved layout fidelity and subject individuality, with quantitative gains on DrawBench and supportive user studies. The method enables complex, multi-subject prompts to be realized more faithfully, offering a practical tool for layout-controlled diffusion without additional training.

Abstract

Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
Paper Structure (37 sections, 6 equations, 17 figures, 2 tables)

This paper contains 37 sections, 6 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Our method bounds the attention to enable layout control over a pre-trained text-to-image diffusion model. Bounded Attention effectively reduces the impact of the innate semantic leakage during denoising, encouraging each subject to be itself. Our method can faithfully generate challenging layouts featuring multiple similar subjects with different modifiers (e.g., ginger and gray kittens).
  • Figure 2: Misalignment in layout-to-image generation include (i) catastrophic neglectchefer2023attend where the model fails to include one or more subjects mentioned in the prompt within the generated image, (ii) incorrect attribute bindingchefer2023attendrassin2023linguistic where attributes are not correctly matched to their corresponding subjects, and (iii) subject fusionzhao2023loco where the model merges multiple subjects into a single, larger subject.
  • Figure 3: Cross-Attention Leakage. We demonstrate the emergence of semantic leakage at the cross-attention layers. We show two examples: a puppy a kitten, and a hamster and a squirrel. In the two leftmost columns, the subjects were generated separately using Stable Diffusion (SD). In the right three columns, we generate a single image with the two subjects using three different methods: Stable Diffusion (SD), Layout Guidance (LG), and Bounded Attention (BA, ours). Under each row, we plot the two first principal components of the cross-attention queries. As can be seen, the separation of the queries (blue and red) reflects the leakage between the subjects in the generated images.
  • Figure 4: Self-Attention Leakage. We demonstrate the emergence of semantic leakage at the self-attention maps of two subjects: a crab and a frog. The images are generated by Stable Diffusion (SD) and Layout-guidance (LG). The top row highlights specific pixels, such as those of a subject's eye or leg, while the bottom row present their respective self-attention maps.
  • Figure 5: We generate different subjects, and plot the first two principal components of the cross-attention queries at different layers of the UNet, where each layer is of different resolution. The high semantic similarity between the kitten and the puppy is expressed by the proximity of their queries through all layers. Meanwhile, the lizard and fruit share similar texture, and hence only their high-resolution queries are entangled.
  • ...and 12 more figures