Table of Contents
Fetching ...

Interpreting CLIP's Image Representation via Text-Based Decomposition

Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

TL;DR

This work presents a scalable framework to interpret CLIP-ViT by decomposing its image representations into layer-, head-, and token-level contributions, anchored by CLIP's text-space. It finds that the final four MSA layers drive most direct effects, and introduces TextSpan to label head- and direction-specific outputs with text, revealing property-specific heads and emergent spatial localization. The authors demonstrate practical benefits, including reducing spurious cues in Waterbirds and achieving state-of-the-art zero-shot semantic segmentation through image-token decomposition, as well as enabling property-based image retrieval. Overall, the paper provides a principled method to dissect transformer-based multimodal encoders and shows how such insights can repair and improve downstream performance.

Abstract

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.

Interpreting CLIP's Image Representation via Text-Based Decomposition

TL;DR

This work presents a scalable framework to interpret CLIP-ViT by decomposing its image representations into layer-, head-, and token-level contributions, anchored by CLIP's text-space. It finds that the final four MSA layers drive most direct effects, and introduces TextSpan to label head- and direction-specific outputs with text, revealing property-specific heads and emergent spatial localization. The authors demonstrate practical benefits, including reducing spurious cues in Waterbirds and achieving state-of-the-art zero-shot semantic segmentation through image-token decomposition, as well as enabling property-based image retrieval. Overall, the paper provides a principled method to dissect transformer-based multimodal encoders and shows how such insights can repair and improve downstream performance.

Abstract

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
Paper Structure (19 sections, 11 equations, 13 figures, 13 tables, 1 algorithm)

This paper contains 19 sections, 11 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: CLIP-ViT image representation decomposition. By decomposing CLIP's image representation as a sum across individual image patches, model layers, and attention heads, we can (a) characterize each head’s role by automatically finding text-interpretable directions that span its output space, (b) highlight the image regions that contribute to the similarity score between image and text, and (c) present what regions contribute towards a found text direction at a specific head.
  • Figure 2: MLPs mean-ablation. We simultaneously replace all the direct effects of the MLPs with their average taken across ImageNet's validation set. This results in only a small reduction in zero-shot classification performance.
  • Figure 3: MSAs accumulated mean-ablation. We replace all the direct effects of the MSAs up to a given layer with their average taken across the ImageNet validation set. Only the replacement of the last few layers causes a large decrease in accuracy.
  • Figure 4: ImageNet classification accuracy for the image representation projected to TextSpan bases. We evaluate our algorithm for different initial description pools, and with different output sizes.
  • Figure 5: Top-4 images for the top head description found by TextSpan. We retrieve images with the highest similarity score between $c^{l,h}_{\text{head}}$ and the top text representation found by TextSpan. They correspond to the provided text descriptions. See Figure \ref{['fig:large_nns']} in the appendix for randomly selected heads.
  • ...and 8 more figures