Table of Contents
Fetching ...

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

TL;DR

This work automates the decomposition of the final representation into contributions from different model components, and linearly map these contributions to CLIP space to interpret them via text, and introduces a novel scoring function to rank components by their importance with respect to specific features.

Abstract

Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features. These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations. We release our code to reproduce the experiments at https://github.com/SriramB-98/vit-decompose

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

TL;DR

This work automates the decomposition of the final representation into contributions from different model components, and linearly map these contributions to CLIP space to interpret them via text, and introduces a novel scoring function to rank components by their importance with respect to specific features.

Abstract

Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features. These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations. We release our code to reproduce the experiments at https://github.com/SriramB-98/vit-decompose
Paper Structure (23 sections, 1 theorem, 12 equations, 16 figures, 6 tables, 3 algorithms)

This paper contains 23 sections, 1 theorem, 12 equations, 16 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

Both of the above conditions together imply that all linear maps $f_i$ must be a scalar multiple of an orthogonal transformation, that is for all $i$, $f_i^T f_i = k I$ for some constant $k$. Here, $I$ is the identity transformation.

Figures (16)

  • Figure 1: (Left) Workflow: The first step (RepDecompose) is to decompose a representation ${\bm{z}}$ into contributions from its model components ${\bm{c}}_i$ after being transformed by residual transformations like LayerNorm, linear projections, resampling, patch merging and so on. The second step (CompAlign) aligns each contribution to CLIP space using a set of linear maps $f_0, f_1, \dots, f_n$ on the corresponding contributions ${\bm{c}}_0, {\bm{c}}_1, \dots, {\bm{c}}_n$. We can then interpret these aligned contributions using the CLIP text encoder. (Right) Applications of our method: (a) Visualizing contributions of each token through a specific component using a joint token-component decomposition (b) Retrieving images that are close matches of the reference image (on top) with respect to a given image feature like pattern, person, or location
  • Figure 2: Ablation results for various different image encoders. The top-1 ImageNet accuracy is plotted as the layers of the model are increasingly ablated away, starting from the last layer up till the first layer. The circles on the plot represent the endpoints of blocks, the definition of which varies across model architectures. For the vanilla ViT variants, a block is an attention MLP pair, while for SWIN, it is a pair of windowed/shifted windowed attention and an MLP. For MaxVit, this might either be a grid/block attention-MLP pair, or an MBConv block.
  • Figure 3: Top-3 images retrieved by DeiT components for "forest" and "beach" ordered according to their relevance for the attribute "location". Each column here corresponds to the images returned by the sum of contributions of 3 components, so column $i$ corresponds to components ${\bm{c}}_{3i}, {\bm{c}}_{3i + 1}, {\bm{c}}_{3i + 2}$. A large fraction of components which can recognize the "location" feature are sorted correctly by the scoring function
  • Figure 4: Top-3 images retrieved by the most significant components for various features relevant to the reference image (displayed on top). The models used are (from left to right) DINO, DeiT, and SWIN. More exhaustive results can be found in appendix \ref{['sec:img_based_img_retrieval_full']}
  • Figure 5: Visualization of token contributions as heatmaps for two example images for the DeiT model. The relevant feature and the head most closely associated with the feature is displayed on the bottom of the heatmap, while the feature instantiation is displayed on the top. The layer numbering starts from the last layer (which has index '00'). The regions highlighted in red contribute positively to the prediction, while blue regions contribute negatively. More results in appendix \ref{['sec:tok_viz_full']}
  • ...and 11 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof