Table of Contents
Fetching ...

Quantifying and Enabling the Interpretability of CLIP-like Models

Avinash Madasu, Yossi Gandelsman, Vasudev Lal, Phillip Howard

TL;DR

This work tackles the interpretability of CLIP-like vision-language transformers by decomposing attention heads with the TextSpan algorithm and labeling head properties through in-context learning. It introduces two metrics, the entanglement score and the association score, to quantify how cleanly properties map to individual heads and how independently heads attend to those properties. Across six CLIP variants, the study finds that larger models tend to be more interpretable, exhibiting reduced entanglement and higher property-consistency. To translate insights into practice, the authors implement CLIP-InterpreT, a tool offering five analyses (including per-head segmentation and nearest-neighbor searches) to help users understand the inner workings of CLIP-like models in real-world scenarios.

Abstract

CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CLIP like models. We conduct this study on six different CLIP models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties. We then evaluate how easily these heads can be interpreted using new metrics which measure property consistency within heads and property disentanglement across heads. Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts. To further assist users in understanding the inner workings of CLIP models, we introduce CLIP-InterpreT, a tool designed for interpretability analysis. CLIP-InterpreT offers five types of analyses: property-based nearest neighbor search, per-head topic segmentation, contrastive segmentation, per-head nearest neighbors of an image, and per-head nearest neighbors of text.

Quantifying and Enabling the Interpretability of CLIP-like Models

TL;DR

This work tackles the interpretability of CLIP-like vision-language transformers by decomposing attention heads with the TextSpan algorithm and labeling head properties through in-context learning. It introduces two metrics, the entanglement score and the association score, to quantify how cleanly properties map to individual heads and how independently heads attend to those properties. Across six CLIP variants, the study finds that larger models tend to be more interpretable, exhibiting reduced entanglement and higher property-consistency. To translate insights into practice, the authors implement CLIP-InterpreT, a tool offering five analyses (including per-head segmentation and nearest-neighbor searches) to help users understand the inner workings of CLIP-like models in real-world scenarios.

Abstract

CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CLIP like models. We conduct this study on six different CLIP models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties. We then evaluate how easily these heads can be interpreted using new metrics which measure property consistency within heads and property disentanglement across heads. Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts. To further assist users in understanding the inner workings of CLIP models, we introduce CLIP-InterpreT, a tool designed for interpretability analysis. CLIP-InterpreT offers five types of analyses: property-based nearest neighbor search, per-head topic segmentation, contrastive segmentation, per-head nearest neighbors of an image, and per-head nearest neighbors of text.
Paper Structure (14 sections, 15 figures, 2 tables)

This paper contains 14 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Interface of CLIP-InterpreT application. Users can upload an image and select a model to analyze. There are five tabs for different decomposition analyses.
  • Figure 2: Top-4 nearest neighbors for "colors" property. The model used is ViT-B-32 (Data comp). The input is the image of tiger which is on the left of the dotted lines. The outputs are the four images, right of the dotted lines. In this example, we see that both the input and retrieved output images have common orange, black, and green colors.
  • Figure 3: Topic Segmentation results for Layer 11, Head 3 (an "environment/weather" head).. The model used is ViT-B-16 (LAION-2B). In the first image (left), the heatmap (blue) is focused on "flowers" which matches the text description. In the second image (middle), the heatmap (blue) is concentrated on the "tornado" matching the text description. In the last image, the heatmap (blue) is focused on "sun" matching the description "Hot Summer".
  • Figure 4: Image shows the contrastive Segmentation between portions of the image containing "tornado" and "thunderstorm." The model used is ViT-L-14 pretrained on LAION-2B dataset.
  • Figure 5: Top-8 nearest neighbors per head and image. The input image is provided on the left, with the head-specific nearest neighbors shown on the right. The model used is OpenAI-400M.
  • ...and 10 more figures