Table of Contents
Fetching ...

Dissecting Query-Key Interaction in Vision Transformers

Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz

TL;DR

This work addresses how vision transformers process context and salience via self-attention by analyzing the query-key interaction through a singular value decomposition of ${\mathbf{W}_q^T\mathbf{W}_k}$. The authors show that early layers tend to group similar tokens, while deeper layers increasingly contextualize using dissimilar tokens, with many singular modes offering semantically meaningful interpretations of token interactions. By decomposing attention into left-right singular vector pairs and associated singular values, they provide a principled, interpretable framework for understanding how attention combines information across objects, parts, and background. The approach enhances explainability of ViTs and offers a pathway to extend singular-mode analysis to other modalities such as language transformers.

Abstract

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to analyze the query-key interaction by the singular value decomposition of the interaction matrix (i.e. ${\textbf{W}_q}^\top\textbf{W}_k$). We find that in many ViTs, especially those with classification training objectives, early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.

Dissecting Query-Key Interaction in Vision Transformers

TL;DR

This work addresses how vision transformers process context and salience via self-attention by analyzing the query-key interaction through a singular value decomposition of . The authors show that early layers tend to group similar tokens, while deeper layers increasingly contextualize using dissimilar tokens, with many singular modes offering semantically meaningful interpretations of token interactions. By decomposing attention into left-right singular vector pairs and associated singular values, they provide a principled, interpretable framework for understanding how attention combines information across objects, parts, and background. The approach enhances explainability of ViTs and offers a pathway to extend singular-mode analysis to other modalities such as language transformers.

Abstract

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to analyze the query-key interaction by the singular value decomposition of the interaction matrix (i.e. ). We find that in many ViTs, especially those with classification training objectives, early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.
Paper Structure (13 sections, 7 equations, 25 figures)

This paper contains 13 sections, 7 equations, 25 figures.

Figures (25)

  • Figure 1: We propose a new way to study query-key interactions via the singular value decomposition of the query-key interaction matrix. Many of the modes (i.e. pairs of singular vectors corresponding to the query and the key respectively), are semantically interpretable. Two example modes are shown. Top row: ViT layer 8 head 7 mode 2. Bottom row: DINO layer 8 head 9 mode 2. The red channel indicates the projection value of embedding onto the left singular vector which corresponds to the query; the cyan channel indicates the projection value of embedding onto the right singular vector which corresponds to the key.
  • Figure 2: Attention preference in the Odd-One-Out (O3) dataset Kotseruba2019BMVC. A. An example from the O3 dataset. Two tokens are chosen to correspond to the target and distractor in the image. Attention maps using two tokens as queries are computed. We examine the overlap between the attention map of the target, and each of the mask labels of the target, distractor, and background masks. Similarly, we examine the overlap between the attention map of the distractor, and each of the mask labels of the distractor, target, and background. B. Ratio of attention on the same objects (target-target and distractor-distractor attention). The x-axis is normalized layer numbers, from early layers (left) to late layers (right). C. Ratio of attention on the different objects (target-distractor and distractor-target attention). D. Ratio of attention on the background (target-to-background and distractor-background attention)
  • Figure 3: Cosine similarity between left and right singular vectors. The cosine similarity is computed per head and singular mode. The weighted average value of cosine similarity is computed with weights of corresponding singular values.
  • Figure 4: Examples of optimal attention images of singular modes and query and key map in dino-vitb16. Optimal attention images are found from the Imagenet validation set that induce the largest attention score (sorted by the product of the maximum of query map and maximum of key map). The red and cyan channels are the projection values of embedding onto the left and right singular vectors of a singular mode. They correspond to query and key. The white area is where the query map and key map overlap. The name code we assign to singular modes specifies the layer, head, and mode numbers. For example, "L1 H4 M3" means layer 1, head 4, and mode 3. The value below indicates the cosine similarity between the left and right singular vectors.
  • Figure 5: Visualization of a single image with multiple modes. We pick an example dog image from the ImageNet dataset and use the dino-vitb16 model. Top 6 modes (ordered by the contribution to the attention score) for example layers and heads are shown. See Supplementary Figure \ref{['SFig:16']} for extended mode visualizations of this image.
  • ...and 20 more figures