Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings
Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, Soheil Feizi
TL;DR
This work analyzes why text-to-image diffusion models struggle with compositional prompts, identifying erroneous CLIP attention contributions and sub-optimal CLIP text embeddings as key causes. It introduces WiCLP, a window-based linear projection of CLIP outputs (and a token-wise variant CLP) to align the text-embedding space with a more compositional representation, augmented by a Switch-Off strategy to limit projection use during inference. Across multiple SD variants and other models, WiCLP significantly improves compositional attribute binding (color, texture, shape) as measured by VQA/TIFA, with competitive FID on clean prompts and reduced parameter/compute costs compared to full CLIP finetuning. The results suggest that a lightweight, trainable projection layer can substantially enhance compositional generation without sacrificing overall image quality, highlighting a practical path to more faithful scene composition in diffusion models. Limitations remain in modeling complex spatial relations and numeracy, pointing to future work on improving CLIP's compositional understanding and extending the approach to broader encoders and prompts.
Abstract
Text-to-image diffusion-based generative models have the stunning ability to generate photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate compositional attribute binding failures, where the model fails to correctly associate descriptive attributes (such as color, shape, or texture) with the corresponding objects in the generated images, and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes showing that the output space of the CLIP text-encoder is sub-optimal, and (ii) the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that significant compositional improvements can be achieved (without harming the model's FID score) by fine-tuning only a simple and parameter-efficient linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs.
