Table of Contents
Fetching ...

Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models!

Arash Marioriyad, Mohammadali Banayeeanzade, Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR

The paper investigates why entity missing occurs in text-to-image diffusion and identifies cross-attention overlap between entity prompts as the root cause. It introduces four training-free overlap-based losses—IoU, $D_{CoM}$, $D_{KL}$, and $CC$—to minimize overlap during denoising, updating latent codes without retraining. Across COCO-Comp, T2I-CompBench, HRS-Bench, and COCO captions, these methods significantly improve compositional accuracy and human/VQA/CLIP metrics, with CoM Distance often delivering the strongest gains and only minor quality trade-offs as measured by FID, Coverage, and Density. The work demonstrates robust performance across backbones (SD 1.4/2/XL) and prompts of varying complexity, validating a practical, plug-in strategy for enhancing compositional generation without expensive fine-tuning. It also highlights potential directions to improve text encoding and evaluation biases, aiming to broaden the applicability of reliable, entity-faithful image synthesis.

Abstract

Text-to-image diffusion models, such as Stable Diffusion and DALL-E, are capable of generating high-quality, diverse, and realistic images from textual prompts. However, they sometimes struggle to accurately depict specific entities described in prompts, a limitation known as the entity missing problem in compositional generation. While prior studies suggested that adjusting cross-attention maps during the denoising process could alleviate this problem, they did not systematically investigate which objective functions could best address it. This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics: (1) insufficient attention intensity for certain entities, (2) overly broad attention spread, and (3) excessive overlap between attention maps of different entities. We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing. Specifically, we hypothesize that tokens related to specific entities compete for attention on certain image regions during the denoising process, which can lead to divided attention across tokens and prevent accurate representation of each entity. To address this issue, we introduced four loss functions, Intersection over Union (IoU), center-of-mass (CoM) distance, Kullback-Leibler (KL) divergence, and clustering compactness (CC) to regulate attention overlap during denoising steps without the need for retraining. Experimental results across a wide variety of benchmarks reveal that these proposed training-free methods significantly improve compositional accuracy, outperforming previous approaches in visual question answering (VQA), captioning scores, CLIP similarity, and human evaluations. Notably, these methods improved human evaluation scores by 9% over the best baseline, demonstrating substantial improvements in compositional alignment.

Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models!

TL;DR

The paper investigates why entity missing occurs in text-to-image diffusion and identifies cross-attention overlap between entity prompts as the root cause. It introduces four training-free overlap-based losses—IoU, , , and —to minimize overlap during denoising, updating latent codes without retraining. Across COCO-Comp, T2I-CompBench, HRS-Bench, and COCO captions, these methods significantly improve compositional accuracy and human/VQA/CLIP metrics, with CoM Distance often delivering the strongest gains and only minor quality trade-offs as measured by FID, Coverage, and Density. The work demonstrates robust performance across backbones (SD 1.4/2/XL) and prompts of varying complexity, validating a practical, plug-in strategy for enhancing compositional generation without expensive fine-tuning. It also highlights potential directions to improve text encoding and evaluation biases, aiming to broaden the applicability of reliable, entity-faithful image synthesis.

Abstract

Text-to-image diffusion models, such as Stable Diffusion and DALL-E, are capable of generating high-quality, diverse, and realistic images from textual prompts. However, they sometimes struggle to accurately depict specific entities described in prompts, a limitation known as the entity missing problem in compositional generation. While prior studies suggested that adjusting cross-attention maps during the denoising process could alleviate this problem, they did not systematically investigate which objective functions could best address it. This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics: (1) insufficient attention intensity for certain entities, (2) overly broad attention spread, and (3) excessive overlap between attention maps of different entities. We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing. Specifically, we hypothesize that tokens related to specific entities compete for attention on certain image regions during the denoising process, which can lead to divided attention across tokens and prevent accurate representation of each entity. To address this issue, we introduced four loss functions, Intersection over Union (IoU), center-of-mass (CoM) distance, Kullback-Leibler (KL) divergence, and clustering compactness (CC) to regulate attention overlap during denoising steps without the need for retraining. Experimental results across a wide variety of benchmarks reveal that these proposed training-free methods significantly improve compositional accuracy, outperforming previous approaches in visual question answering (VQA), captioning scores, CLIP similarity, and human evaluations. Notably, these methods improved human evaluation scores by 9% over the best baseline, demonstrating substantial improvements in compositional alignment.

Paper Structure

This paper contains 48 sections, 16 equations, 14 figures, 17 tables.

Figures (14)

  • Figure 1: Comparison of compositional generation capabilities between Stable Diffusion, Attend-and-Excite, and one of our proposed overlap-based methods (CoM Distance) for textual prompts containing two and three entities while using SD-1.4 as the backbone: Stable Diffusion (first row) and Attend-and-Excite (second row) often fail to generate all the specified entities in the input prompt, a problem known as entity missing. Our training-free approach (third row) addresses this issue by employing an overlap-based objective function (CoM Distance) on cross-attention maps during the denoising steps, resulting in a more faithful generation of all the entities mentioned in the input prompt.
  • Figure 2: Left: Three attention maps of entity "bird" are drawn at different time steps (0, 25, and 50) along with the generated image for both success and failure cases. Right: While in both success and failure cases, the attention intensity (Equation \ref{['eq:intensity']}) of entity "bird" decreased over time, the bad initialization of $z_T$ with $seed=12$ resulted in the vanishing of the attention scores of "bird" at $t=50$.
  • Figure 3: Left: Three attention maps of entity "snowboard" are drawn at different time steps (0, 25, and 50) along with the generated image for both success and failure cases. Right: During most of the time steps, the attention spread (Equation \ref{['eq:variance']}) of entity "snowboard" in the failure case is much higher than in the success case, resulting in the entity missing problem.
  • Figure 4: Left: For both the success and the failure cases and at three time steps (0, 25, and 50), the attention maps of entities 'cat' and 'vase' are depicted together in one image, with the attention scores of cat and vase being red and blue, respectively. Right: During most of the time steps, the attention overlap measured by $IoU$ metric (Table \ref{['table:overlap_based_metrics']}) between two entities in the failure case is much higher than that in the success case, resulting in the missing entity "cat" in the generated image.
  • Figure 5: Left: The relationships between the proposed metrics and the VQA score: A higher value of $CoM$ distance, $KL$ divergence, and $CC$, and attention intensity results in a higher VQA score, while a lower value of $IoU$ and attention spread ($Var$) improves the possibility of faithfully generating entities. Right: The correlation matrix showing the correlation values between the proposed metrics and the VQA score: The results reveal strong correlations between the overlap-based metrics ($IoU$, $CoM$ distance, $KL$ divergence, and $CC$) and the success rate of the model in faithfully generating the entities. In particular, the $CoM$ distance, $KL$ divergence, and $CC$ metrics exhibit strong positive correlations, while the $IoU$ metric shows a strong negative correlation with the VQA score. Moreover, there is a strong correlation among each pair of the overlap-based metrics as well.
  • ...and 9 more figures