Table of Contents
Fetching ...

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, Xiaomeng Li

TL;DR

This work scrutinizes explainability gaps in CLIP, notably opposite foreground emphasis and noisy activations in CAM-based visualizations. It attributes these issues to inconsistent self-attention and redundant cross-modal features, and introduces CLIP Surgery, comprising Architecture Surgery (consistent self-attention with dual-paths) and Feature Surgery (removing redundant features) that operate at inference without fine-tuning. Across multiple backbones and datasets, the method delivers substantial gains in explainability metrics and enables open-vocabulary segmentation, multi-label recognition, interactive segmentation, and multimodal visualization, often surpassing state-of-the-art CAM approaches without extra training. The approach provides practical, high-quality CAM for CLIP and offers insights into the architecture and feature dynamics of multimodal transformers.

Abstract

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

TL;DR

This work scrutinizes explainability gaps in CLIP, notably opposite foreground emphasis and noisy activations in CAM-based visualizations. It attributes these issues to inconsistent self-attention and redundant cross-modal features, and introduces CLIP Surgery, comprising Architecture Surgery (consistent self-attention with dual-paths) and Feature Surgery (removing redundant features) that operate at inference without fine-tuning. Across multiple backbones and datasets, the method delivers substantial gains in explainability metrics and enables open-vocabulary segmentation, multi-label recognition, interactive segmentation, and multimodal visualization, often surpassing state-of-the-art CAM approaches without extra training. The approach provides practical, high-quality CAM for CLIP and offers insights into the architecture and feature dynamics of multimodal transformers.

Abstract

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.
Paper Structure (14 sections, 12 equations, 11 figures, 6 tables)

This paper contains 14 sections, 12 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: The CAM of CLIP exhibits opposite and noisy visualizations, as indicated by the black boxes in the second and third rows, respectively. In contrast, our CLIP surgery addresses these issues, resulting in improved visualizations, as shown in the last row. The foregrounds are shown in red, while the backgrounds are indicated in blue.
  • Figure 2: The proposed CLIP Surgery contains two parts as the top part. For the architecture surgery, it is built on a dual paths structure, where the new blocks (left bottom part) are based on the consistent self-attention using homogeneous parameters without feed-forward networks (FFN). The data flow of feature surgery is given in the right bottom part to mitigate redundant features across texts.
  • Figure 3: CLIP shows opposite visualization with noisy activations, which are common on varied backbones (ResNet he2016deep, ViT dosovitskiy2020image) and methods (Grad-CAM selvaraju2017grad, Bi-modal chefer2021generic). Note, Bi-modal for ViT is not applicable to ResNet.
  • Figure 4: The raw self-attention pays more attention on backgrounds, while the proposed consistent self-attention corrects it. (a) Qualitative comparison between the raw self-attention and our consistent self-attention at the token of the highest score on the last layer. The raw self-attention pays much attention to the opposite semantic regions, while our consistent self-attention links nearby tokens at the same semantics. (b) The quantitative comparison for self-attention focuses on foregrounds. Note, the y-axis indicates the sample ratio whose mFSR is higher than a threshold, and mFSR measures the attention degree on the foregrounds as Eq. \ref{['eq_mFSR']}.
  • Figure 5: Analysis for intermediate self-attention blocks (green) and feed-forward networks (FFNs, colored in yellow) at different depths via the affinity in Eq. \ref{['eq_cos_angle']}. The blue line indicates the mean cosine of all positive labels on the VOC 2012 dataset everingham2010pascal using backbone ViT-B/16, and the red line is that of negative labels. All scatters are the mean cosine of positive labels and are expected to be close to the blue line. Larger scatters indicate the features of this block are more inconsistent with the final prediction.
  • ...and 6 more figures