Table of Contents
Fetching ...

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

Chenyang Zhao, Kun Wang, Janet H. Hsiao, Antoni B. Chan

TL;DR

Grad-ECLIP introduces a white-box, gradient-based approach to explain CLIP by producing heat maps that reflect how image regions and words influence image-text matching. By deriving channel weights from gradients and employing a loosened spatial weight to counteract sparse attention, Grad-ECLIP yields high-quality, result-specific visual and textual explanations applicable to ViT and CNN backbones, as well as other vision-language models. The authors demonstrate superior faithfulness, localization, and robustness across datasets and domains, and show how these explanations reveal CLIP’s concept decomposition, attribution tendencies, and tendency to rely on concrete words. Additionally, Grad-ECLIP enables a fine-grained CLIP fine-tuning framework that uses region-phrase mappings to strengthen region-wise alignment without sacrificing global performance. Overall, Grad-ECLIP provides a practical, generalizable tool for interpreting and improving vision-language models, with implications for debugging, prompt design, and downstream dense prediction tasks.

Abstract

Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

TL;DR

Grad-ECLIP introduces a white-box, gradient-based approach to explain CLIP by producing heat maps that reflect how image regions and words influence image-text matching. By deriving channel weights from gradients and employing a loosened spatial weight to counteract sparse attention, Grad-ECLIP yields high-quality, result-specific visual and textual explanations applicable to ViT and CNN backbones, as well as other vision-language models. The authors demonstrate superior faithfulness, localization, and robustness across datasets and domains, and show how these explanations reveal CLIP’s concept decomposition, attribution tendencies, and tendency to rely on concrete words. Additionally, Grad-ECLIP enables a fine-grained CLIP fine-tuning framework that uses region-phrase mappings to strengthen region-wise alignment without sacrificing global performance. Overall, Grad-ECLIP provides a practical, generalizable tool for interpreting and improving vision-language models, with implications for debugging, prompt design, and downstream dense prediction tasks.

Abstract

Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.

Paper Structure

This paper contains 35 sections, 15 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Visual and textual explanations of CLIP for the image with the text "A dog is playing with frisbee" using (a) CLIPSurgery li2023clipsurgery; (b) MaskCLIP zhou2022extract; (c) Grad-CAM selvaraju2017grad; (d) RISE petsiuk2018rise; (e) raw attention in the last layer; (f) Rollout abnar2020quantifying; (g) GAME chefer2021generic; (h) M2IB wang2024visual; and (i) Our Grad-ECLIP. For (e) to (i), textual explanations on the sentence are shown, where the degree of green color represents the word importance. Other methods (a-d) are not applicable on text.
  • Figure 1: Faithfulness evaluation of image explanation on the ImageNet validation set: AUC for Deletion and Insertion curves, based on Top-1 (@1) or Top-5 (@5) classification accuracy. Either the ground-truth or the prediction are used as the text input into CLIP. The second best is shown with underline.
  • Figure 2: Illustration of Grad-ECLIP. An image-text pair specific visual explanation is generated by weighting and aggregating the values as feature map in the attention layer with spatial importance $\lambda_{i}$ and channel importance $w_{c}$. Gradients are propagated to the attention layer output to produce $w_{c}$, and the loosened attention map is applied as $\lambda_{i}$.
  • Figure 2: Evaluation of image explanation faithfulness on MS COCO image-text retrieval (Karpathy's split) validation dataset: AUC for Deletion and Insertion curves for performance on image retrieval (IR) and text retrieval (TR) tasks.
  • Figure 3: Comparison of heat maps from: (a) the raw self-attention map in the last ViT layer; (b) Rollout abnar2020quantifying; (c) Grad-CAM selvaraju2017grad; (d) GAME chefer2021generic; (e) MaskCLIP zhou2022extract; (f) CLIPSurgery li2023clipsurgery; (g) M2IB wang2024visual; (h) RISE petsiuk2018rise; (i) our proposed Grad-ECLIP. Visual explanations are provided for the matching score between the image and the specific text prompts, which can be nouns (e.g., car, dog) or verbs (e.g., holding, standing). From the comparison of visualizations, Grad-ECLIP exhibits superior explanation ability on different types of text prompts.
  • ...and 14 more figures