Table of Contents
Fetching ...

Divide & Bind Your Attention for Improved Generative Semantic Nursing

Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva

TL;DR

This work tackles semantic fidelity in text-to-image diffusion models by addressing missing objects and attribute misbinding, especially for complex prompts. It introduces Divide & Bind, an inference-time approach that applies a total-variation attendance loss to create multiple spatial excitations and a Jensen-Shannon binding loss to align attributes with their corresponding objects. The method leverages Generative Semantic Nursing to update latent codes during sampling without fine-tuning the model, yielding improved adherence to prompts across diverse benchmarks, particularly for multi-object descriptions. While effective, it acknowledges limitations with extremely rare attribute-object combinations and potential miscounting due to reliance on pretrained evaluators.

Abstract

Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., "a cat and a dog". However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks.

Divide & Bind Your Attention for Improved Generative Semantic Nursing

TL;DR

This work tackles semantic fidelity in text-to-image diffusion models by addressing missing objects and attribute misbinding, especially for complex prompts. It introduces Divide & Bind, an inference-time approach that applies a total-variation attendance loss to create multiple spatial excitations and a Jensen-Shannon binding loss to align attributes with their corresponding objects. The method leverages Generative Semantic Nursing to update latent codes during sampling without fine-tuning the model, yielding improved adherence to prompts across diverse benchmarks, particularly for multi-object descriptions. While effective, it acknowledges limitations with extremely rare attribute-object combinations and potential miscounting due to reliance on pretrained evaluators.

Abstract

Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., "a cat and a dog". However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks.
Paper Structure (28 sections, 5 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 5 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our Divide & Bind can faithfully generate multiple objects based on detailed textual description. Compared to prior state-of-the-art semantic nursing technique for text-to-image synthesis, Attend & Excite chefer2023attendandexcite, our approach exhibits superior alignment with the input prompt and maintain a higher level of realism.
  • Figure 2: Method overview. We perform latent optimization on-the-fly based on the attention maps of the object tokens with our TV-based $L_{attend}$ and JSD-based $L_{bind}$.
  • Figure 3: Cross-attention visualization in different timesteps for each object token and predicted clean image $\hat{x_0}^{(t)}$. Note that this is GIF, video version can be found on the https://sites.google.com/view/divide-and-bind.
  • Figure 4: Binding loss ablation. $L_{bind}$ aligns the excitation of attribute and object attention.
  • Figure 5: Qualitative comparison in different settings with the same random seeds. Tokens used for optimization are highlighted in blue. Compared to others, Divide & Bind shows superior alignment with the input prompt while maintaining a high level of realism.
  • ...and 5 more figures