Table of Contents
Fetching ...

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Jeeyung Kim, Erfan Esmaeili, Qiang Qiu

TL;DR

This work proposes a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.

Abstract

In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

TL;DR

This work proposes a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.

Abstract

In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.

Paper Structure

This paper contains 21 sections, 5 theorems, 47 equations, 11 figures, 3 tables.

Key Result

Proposition 1

If $A^{(\ell,h)}\in\mathbb{R}^{ {N_\text{c}}\times s}$ is a cross-attention map defined in eq. eq:cross_attn, then under the assumptions i, ii, and iii described in Appendix A, the cosine similarity matrix can be written in terms of key vectors $\mathbf{k}_i^{(\ell,h)}\in \mathbb{R}^{{H_\text{c}}{D_ up to terms of at least $\mathcal{O}(1/\sqrt{{N_\text{c}}})$ and $\mathcal{O}(\epsilon)$, where $W^

Figures (11)

  • Figure 1: The overview of our method. We leverage text self-attention matrix and optimize the latent noise ($z_t$) by minimizing the distance between the cross-attention similarity matrix ($\mathsf{S}$) and the text self-attention matrix ($\mathsf{T}$). This encourages integrating syntactic relationships into text-to-image diffusion models.
  • Figure 2: For the analysis, we use the prompt sets from chefer2023attend, structured as "[$\text{attribute}_1$] [$\text{object}_1$] and [$\text{attribute}_2$] [$\text{object}_2$]", "[$\text{object}_1$] and/with [$\text{object}_2$]" or "[$\text{object}_1$] and [$\text{attribute}_2$] [$\text{object}_2$]". (a) Comparison of the cosine similarity of text embeddings with that of the corresponding cross-attention maps at denoising step 1, with pairs of tokens ($\text{object}_i$, $\text{object}_j$), where $i \neq j$, and pairs of tokens ($\text{attribute}_m, \text{object}_n$) for both $m = n$ and $m \neq n$. As text embeddings become more similar, their cross-attention maps get similar. (b) The distributions of text embedding similarity between i) Bound tokens--- ($\text{attribute}_i, \text{object}_i$) for $i=1,2$, and ii) Unbound tokens---($\text{object}_1, \text{object}_2$). The distributions show no discernible difference, indicating text embeddings do not effectively represent the syntactic relationships. (c) Comparison of text embedding similarity (left) and the text self-attention map power by 3 (right) for the prompt a black car and a white clock. In the self-attention maps ($\mathsf{T}$), clock attends more to white, unlike the text embeddings.
  • Figure 3: The generated images and cross-attention maps ($\mathsf{A}$) for the specific tokens from SD v1.5. This illustrate the importance of spatial alignment in cross-attention maps for accurate image generation. Divergent (overlapping) cross-attention maps for syntactically unbound (bound) words enhances text-to-image fidelity.
  • Figure 4: Comparison for the distributions of cosine similarity between cross-attention maps (at denoising step 10). (a) The cases with one missing object--incorrect and two objects present--correct. (b) The cases with incorrect and correct attribute binding. Correct instances are more frequent when the cosine similarity is low for objects presence and high for attribute binding.
  • Figure 5: Correlation between the cosine similarity of text embeddings and that of cross-attention maps across denoising steps ($t=1, 21, 50$). Similar text embeddings generally lead to similar cross-attention maps, with the correlation weakening over time.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Proposition 2
  • Proposition 1
  • proof
  • Lemma 1
  • Proposition 2
  • proof