Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control
Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas Hofmann, Tinne Tuytelaars, Marie-Francine Moens
TL;DR
Text-to-image diffusion models often misbind attributes to objects in prompts with multiple entities. The authors introduce EPViT as a fine-grained image-text alignment predictor and two training-free conditioning methods, FCA and DisCLIP, to improve object-attribute binding without retraining models. They also present DAA-200 as a challenging evaluation benchmark and demonstrate that EPViT accuracy outperforms CLIP in binding assessment while FCA and DisCLIP yield consistent gains across several baselines, reducing attribute leakage. The results generalize to COCO-10K prompts and correlate with human judgments, suggesting practical impact for more reliable and controllable T2I generation.
Abstract
Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.
