Table of Contents
Fetching ...

Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, Soheil Feizi

TL;DR

This work analyzes why text-to-image diffusion models struggle with compositional prompts, identifying erroneous CLIP attention contributions and sub-optimal CLIP text embeddings as key causes. It introduces WiCLP, a window-based linear projection of CLIP outputs (and a token-wise variant CLP) to align the text-embedding space with a more compositional representation, augmented by a Switch-Off strategy to limit projection use during inference. Across multiple SD variants and other models, WiCLP significantly improves compositional attribute binding (color, texture, shape) as measured by VQA/TIFA, with competitive FID on clean prompts and reduced parameter/compute costs compared to full CLIP finetuning. The results suggest that a lightweight, trainable projection layer can substantially enhance compositional generation without sacrificing overall image quality, highlighting a practical path to more faithful scene composition in diffusion models. Limitations remain in modeling complex spatial relations and numeracy, pointing to future work on improving CLIP's compositional understanding and extending the approach to broader encoders and prompts.

Abstract

Text-to-image diffusion-based generative models have the stunning ability to generate photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate compositional attribute binding failures, where the model fails to correctly associate descriptive attributes (such as color, shape, or texture) with the corresponding objects in the generated images, and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes showing that the output space of the CLIP text-encoder is sub-optimal, and (ii) the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that significant compositional improvements can be achieved (without harming the model's FID score) by fine-tuning only a simple and parameter-efficient linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs.

Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings

TL;DR

This work analyzes why text-to-image diffusion models struggle with compositional prompts, identifying erroneous CLIP attention contributions and sub-optimal CLIP text embeddings as key causes. It introduces WiCLP, a window-based linear projection of CLIP outputs (and a token-wise variant CLP) to align the text-embedding space with a more compositional representation, augmented by a Switch-Off strategy to limit projection use during inference. Across multiple SD variants and other models, WiCLP significantly improves compositional attribute binding (color, texture, shape) as measured by VQA/TIFA, with competitive FID on clean prompts and reduced parameter/compute costs compared to full CLIP finetuning. The results suggest that a lightweight, trainable projection layer can substantially enhance compositional generation without sacrificing overall image quality, highlighting a practical path to more faithful scene composition in diffusion models. Limitations remain in modeling complex spatial relations and numeracy, pointing to future work on improving CLIP's compositional understanding and extending the approach to broader encoders and prompts.

Abstract

Text-to-image diffusion-based generative models have the stunning ability to generate photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate compositional attribute binding failures, where the model fails to correctly associate descriptive attributes (such as color, shape, or texture) with the corresponding objects in the generated images, and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes showing that the output space of the CLIP text-encoder is sub-optimal, and (ii) the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that significant compositional improvements can be achieved (without harming the model's FID score) by fine-tuning only a simple and parameter-efficient linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs.
Paper Structure (32 sections, 10 equations, 19 figures, 3 tables)

This paper contains 32 sections, 10 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Overview of our analysis and proposed methods. The figure identifies two sources of errors in Stable Diffusion's inability to generate compositional prompts: (i) erroneous attention contribution in CLIP (minor) and (ii) sub-optimal CLIP text embedding (major). We propose a window-based linear projection (WiCLP), applying linear projection to a token’s surrounding window to enhance embeddings.
  • Figure 2: Qualitative comparison of baselines and our projection method (WiCLP). Incorporating WiCLP significantly improves image alignment with the prompts.
  • Figure 3: The heatmap illustrates unintended attention contributions in CLIP, while highlighting the more accurate performance of T5.
  • Figure 5: Sub-optimality of CLIP Text-Encoder for Compositional Prompts. Optimizing a learnable vector to represent an improved text embedding, while keeping the UNet frozen, enables the generation of more compositionally accurate images.
  • Figure 6: Qualitative results showing the impact of Switch-Off with varying thresholds $\tau$.
  • ...and 14 more figures