Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models
Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, Wonjong Rhee
TL;DR
This work investigates the interpretability of cross-attention in text-to-image diffusion models by introducing Head Relevance Vectors (HRVs) that assign per-head importance to human-specified visual concepts. Through an ordered weakening analysis, HRVs are shown to reflect concept-aligned head activations and enable fine-grained, head-level control via concept strengthening and concept adjusting. The proposed methods improve three visual-generation tasks: reducing polysemous-word misinterpretations in image generation, enhancing attribute edits across five challenging concepts, and mitigating catastrophic neglect in multi-concept generation, with consistent results in both Stable Diffusion v1.4 and SDXL. These findings advance mechanistic understanding of CA layers in diffusion models and provide practical, plug-in strategies for head-level steering without model retraining. The work also discusses extensions, limitations, reproducibility, and ethical considerations for deploying such controllable generative systems.
Abstract
Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.
