Table of Contents
Fetching ...

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas Hofmann, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

Text-to-image diffusion models often misbind attributes to objects in prompts with multiple entities. The authors introduce EPViT as a fine-grained image-text alignment predictor and two training-free conditioning methods, FCA and DisCLIP, to improve object-attribute binding without retraining models. They also present DAA-200 as a challenging evaluation benchmark and demonstrate that EPViT accuracy outperforms CLIP in binding assessment while FCA and DisCLIP yield consistent gains across several baselines, reducing attribute leakage. The results generalize to COCO-10K prompts and correlate with human judgments, suggesting practical impact for more reliable and controllable T2I generation.

Abstract

Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

TL;DR

Text-to-image diffusion models often misbind attributes to objects in prompts with multiple entities. The authors introduce EPViT as a fine-grained image-text alignment predictor and two training-free conditioning methods, FCA and DisCLIP, to improve object-attribute binding without retraining models. They also present DAA-200 as a challenging evaluation benchmark and demonstrate that EPViT accuracy outperforms CLIP in binding assessment while FCA and DisCLIP yield consistent gains across several baselines, reducing attribute leakage. The results generalize to COCO-10K prompts and correlate with human judgments, suggesting practical impact for more reliable and controllable T2I generation.

Abstract

Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.
Paper Structure (35 sections, 6 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Examples of integrating focused cross-attention (FCA) and disentangled CLIP embeddings (DisCLIP) into Stable Diffusion and Attend-and-Excite resulting in (a) a decrease of attribute leakage and b) improved object-attribute binding.
  • Figure 2: Integration of FCA and DisCLIP in a diffusion-based T2I model is straightforward. While DisCLIP encodes the input prompt, the cross-attention of the diffusion model is easily replaced by its FCA variant.
  • Figure 3: An example of (a) a constituency tree and (b) an abstracted constituency tree (removed words in red).
  • Figure 4: Qualitative results that show that the FCA and DisCLIP enhanced models improve attribute binding and decrease attribute leakage in images from (a) DAA-200, (b) CC-500 and (c) COCO-10K.
  • Figure 5: (a) The classification accuracy in % on the ground truth images of DAA-200. (b) The influence of different threshold values $s$ on the EPViT accuracy (in %) on CC-500.
  • ...and 4 more figures