Divide & Bind Your Attention for Improved Generative Semantic Nursing
Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva
TL;DR
This work tackles semantic fidelity in text-to-image diffusion models by addressing missing objects and attribute misbinding, especially for complex prompts. It introduces Divide & Bind, an inference-time approach that applies a total-variation attendance loss to create multiple spatial excitations and a Jensen-Shannon binding loss to align attributes with their corresponding objects. The method leverages Generative Semantic Nursing to update latent codes during sampling without fine-tuning the model, yielding improved adherence to prompts across diverse benchmarks, particularly for multi-object descriptions. While effective, it acknowledges limitations with extremely rare attribute-object combinations and potential miscounting due to reliance on pretrained evaluators.
Abstract
Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., "a cat and a dog". However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks.
