Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

Maria Mihaela Trusca; Wolf Nuyts; Jonathan Thomm; Robert Honig; Thomas Hofmann; Tinne Tuytelaars; Marie-Francine Moens

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas Hofmann, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

Text-to-image diffusion models often misbind attributes to objects in prompts with multiple entities. The authors introduce EPViT as a fine-grained image-text alignment predictor and two training-free conditioning methods, FCA and DisCLIP, to improve object-attribute binding without retraining models. They also present DAA-200 as a challenging evaluation benchmark and demonstrate that EPViT accuracy outperforms CLIP in binding assessment while FCA and DisCLIP yield consistent gains across several baselines, reducing attribute leakage. The results generalize to COCO-10K prompts and correlate with human judgments, suggesting practical impact for more reliable and controllable T2I generation.

Abstract

Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

TL;DR

Abstract

Paper Structure (35 sections, 6 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Evaluation of T2I Generation
T2I Generation Using Diffusion Models
Preliminaries
Cross-Attention in Diffusion Models
New Evaluation Framework
Difficult Adversarial Attributes (DAA-200)
Edge Prediction Vision Transformer (EPViT)
Visual Genome Training Details
Using EPViT as a Prediction Model
Methods to Improve Object-Attribute Binding in T2I Generation
Focused Cross-Attention (FCA)
Disentangled CLIP Encoding (DisCLIP)
Experimental Set-up
...and 20 more sections

Figures (9)

Figure 1: Examples of integrating focused cross-attention (FCA) and disentangled CLIP embeddings (DisCLIP) into Stable Diffusion and Attend-and-Excite resulting in (a) a decrease of attribute leakage and b) improved object-attribute binding.
Figure 2: Integration of FCA and DisCLIP in a diffusion-based T2I model is straightforward. While DisCLIP encodes the input prompt, the cross-attention of the diffusion model is easily replaced by its FCA variant.
Figure 3: An example of (a) a constituency tree and (b) an abstracted constituency tree (removed words in red).
Figure 4: Qualitative results that show that the FCA and DisCLIP enhanced models improve attribute binding and decrease attribute leakage in images from (a) DAA-200, (b) CC-500 and (c) COCO-10K.
Figure 5: (a) The classification accuracy in % on the ground truth images of DAA-200. (b) The influence of different threshold values $s$ on the EPViT accuracy (in %) on CC-500.
...and 4 more figures

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

TL;DR

Abstract

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

Authors

TL;DR

Abstract

Table of Contents

Figures (9)