Table of Contents
Fetching ...

Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution

Ying Wang, Tim G. J. Rudner, Andrew Gordon Wilson

TL;DR

It is demonstrated how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare.

Abstract

Vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability. To improve the interpretability of vision-language models such as CLIP, we propose a multi-modal information bottleneck (M2IB) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features. We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare. Crucially, unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available. Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.

Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution

TL;DR

It is demonstrated how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare.

Abstract

Vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability. To improve the interpretability of vision-language models such as CLIP, we propose a multi-modal information bottleneck (M2IB) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features. We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare. Crucially, unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available. Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.
Paper Structure (20 sections, 16 equations, 9 figures, 2 tables)

This paper contains 20 sections, 16 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Example Attribution Maps For Image and Text Inputs. The red rectangles in the second and third rows show the ground-truth bounding boxes associated with the text, provided in the MS-CXR dataset MSCXR. Multi-modal information bottleneck (M2IB) attribution maps successfully identify relevant objects in the given image-text pairs, while other methods provide less precise localization and neglect critical features in the inputs.
  • Figure 2: Example Saliency Maps when Involving Multiple Objects. Our method can successfully detect all occurrences of all relevant objects in image and text.
  • Figure 3: Visualization of Degradation. The third column is obtained by calculating the element-wise product of the original image and saliency map, while the text with attribution scores lower than 50% percentile is masked by a blank token <B>. It is used in the Increase in Confidence metric and Drop in Confidence metric. The fourth column shows an example of the training data in ROAR+. We replace the image pixels with attribution scores higher than 75% percentile by the channel mean and replace the text tokens with attribution scores higher than 50% by a blank token <B>. The results in \ref{['tab:results']} use the padding token as the blank token <B>.
  • Figure 4: Saliency Maps for Sanity Checks. "Finetuned" represents the model that is finetuned on MIMIC-CXR mimic-cxr, a Chest X-ray dataset. "Pretrained" represents pretrained CLIP CLIP from OpenAI. "Projection" represents CLIP with randomized projection layer, which is the last layer of the image encoder or text encoder that projects image or text features into the shared embedding space. "Random" means that all parameters in the model are randomly initiated. The remaining columns represent models with weights randomized starting from the last to the given layer. The results suggest that the saliency maps of M2IB attribution are sensitive to model weights, as desired, meaning that M2IB passes the sanity check.
  • Figure 5: Visualization of the Impact of Different Hyperparameters. $\beta$ and $\sigma^2$ that make the fitting and compression terms be at a similar scale and deeper layer $\ell$ usually give better performance.
  • ...and 4 more figures