Table of Contents
Fetching ...

Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, Wonjong Rhee

TL;DR

This work investigates the interpretability of cross-attention in text-to-image diffusion models by introducing Head Relevance Vectors (HRVs) that assign per-head importance to human-specified visual concepts. Through an ordered weakening analysis, HRVs are shown to reflect concept-aligned head activations and enable fine-grained, head-level control via concept strengthening and concept adjusting. The proposed methods improve three visual-generation tasks: reducing polysemous-word misinterpretations in image generation, enhancing attribute edits across five challenging concepts, and mitigating catastrophic neglect in multi-concept generation, with consistent results in both Stable Diffusion v1.4 and SDXL. These findings advance mechanistic understanding of CA layers in diffusion models and provide practical, plug-in strategies for head-level steering without model retraining. The work also discusses extensions, limitations, reproducibility, and ethical considerations for deploying such controllable generative systems.

Abstract

Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.

Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

TL;DR

This work investigates the interpretability of cross-attention in text-to-image diffusion models by introducing Head Relevance Vectors (HRVs) that assign per-head importance to human-specified visual concepts. Through an ordered weakening analysis, HRVs are shown to reflect concept-aligned head activations and enable fine-grained, head-level control via concept strengthening and concept adjusting. The proposed methods improve three visual-generation tasks: reducing polysemous-word misinterpretations in image generation, enhancing attribute edits across five challenging concepts, and mitigating catastrophic neglect in multi-concept generation, with consistent results in both Stable Diffusion v1.4 and SDXL. These findings advance mechanistic understanding of CA layers in diffusion models and provide practical, plug-in strategies for head-level steering without model retraining. The work also discusses extensions, limitations, reproducibility, and ethical considerations for deploying such controllable generative systems.

Abstract

Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.

Paper Structure

This paper contains 59 sections, 4 equations, 49 figures, 12 tables, 1 algorithm.

Figures (49)

  • Figure 1: We develop a method for constructing head relevance vectors (HRVs) that align with useful visual concepts. For a specified visual concept, an HRV assigns a relevance score to individual cross-attention heads, revealing their importance for the visual concept. Our analysis shows that the constructed HRVs can serve as interpretable features. We also demonstrate that HRV can be effectively integrated for improving three visual generative tasks.
  • Figure 2: Overview of a single HRV update for a cross-attention (CA) head position $h$. While generating a random image, the most relevant visual concept is identified. Then the concept's head relevance vector (HRV) is updated to have an increased value in position $h$. For illustration purpose, we are showing only 5 visual concepts ($N=5$) and 6 CA heads ($H=6$). In our main experiments, we adopt $N=34$ and $H=128$. This update is repeated over all the head positions $h=1, \dots, H$ and all timesteps $t=1, \dots, T$ for a sufficiently large number of random image generations.
  • Figure 3: Ordered weakening analysis for three visual concepts: The visual concept of interest disappears significantly faster with MoRHF, where the most relevant heads in the corresponding HRV are weakened first. Note that 128 corresponds to the weakening of all heads.
  • Figure 4: Two rescaling vectors for visual concept steering. Left:Concept strengthening uses HRV of a desired visual concept as the rescaling vector. Concept adjusting combines HRVs of a desired and an undesired visual concepts to define the rescaling vector: $2 \cdot \text{(HRV of desired concept)} - 1 \cdot \text{(HRV of undesired concept)}$. Here, $H=128$ denotes the number of CA heads. Right: For both concept steering methods, the $h$-th CA map of a target token is rescaled using $r_h$, the $h$-th element of the rescaling vector, where $h=1,\dots,H$. Here, $L=77$ denotes the token length.
  • Figure 5: Examples of image generations from Stable Diffusion (SD) and SD-HRV (ours) using prompts frequently misinterpreted by T2I models. SD-HRV effectively reduces misinterpretation compared to SD.
  • ...and 44 more figures