Table of Contents
Fetching ...

What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation

Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, Carsten Eickhoff

TL;DR

NOTICE addresses the opacity of vision-language decision-making by combining semantic image corruption (SIP) and symmetric token replacement (STR) with causal mediation analysis via activation patching. It demonstrates that universal attention heads exist across BLIP and LLaVA and that cross-attention governs object detection, object suppression, and outlier suppression, while self-attention mainly handles outlier suppression. The work provides a mechanistic interpretation of multimodal integration and offers pathways for more transparent and adaptable vision-language systems. It also explores the use of generative SIP to extend corruption capability, highlighting the robustness and limitations of the approach across datasets and architectures.

Abstract

Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE incorporates a Semantic Minimal Pairs (SMP) framework for image corruption and Symmetric Token Replacement (STR) for text. This approach enables semantically meaningful causal mediation analysis for both modalities, providing a robust method for analyzing multimodal integration within models like BLIP. Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making, identifying the significant role of middle-layer cross-attention heads. Further, we uncover a set of ``universal cross-attention heads'' that consistently contribute across tasks and modalities, each performing distinct functions such as implicit image segmentation, object inhibition, and outlier inhibition. This work paves the way for more transparent and interpretable multimodal systems.

What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation

TL;DR

NOTICE addresses the opacity of vision-language decision-making by combining semantic image corruption (SIP) and symmetric token replacement (STR) with causal mediation analysis via activation patching. It demonstrates that universal attention heads exist across BLIP and LLaVA and that cross-attention governs object detection, object suppression, and outlier suppression, while self-attention mainly handles outlier suppression. The work provides a mechanistic interpretation of multimodal integration and offers pathways for more transparent and adaptable vision-language systems. It also explores the use of generative SIP to extend corruption capability, highlighting the robustness and limitations of the approach across datasets and architectures.

Abstract

Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE incorporates a Semantic Minimal Pairs (SMP) framework for image corruption and Symmetric Token Replacement (STR) for text. This approach enables semantically meaningful causal mediation analysis for both modalities, providing a robust method for analyzing multimodal integration within models like BLIP. Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making, identifying the significant role of middle-layer cross-attention heads. Further, we uncover a set of ``universal cross-attention heads'' that consistently contribute across tasks and modalities, each performing distinct functions such as implicit image segmentation, object inhibition, and outlier inhibition. This work paves the way for more transparent and interpretable multimodal systems.

Paper Structure

This paper contains 20 sections, 20 figures, 3 tables.

Figures (20)

  • Figure 1: NOTICE applied to SVO-Probes, MIT-States, and Facial Expression Recognition. NOTICE involves creating Semantic Image Pairs for image corruption and Symmetric Token Replacement for text corruption.
  • Figure 2: Activation Patching using SIP corruption. The image of the puppy is the clean image, $I$, and the goat is the corrupt image, $I^{*}$. Patching the correct answer token "puppy" at $M_{l}(I,T)$, from the clean image to the "puppy" token at $M(I^{*}, T)$ creates the patched states $M^{'}(I^{*}, T)$ shown as orange diamonds.
  • Figure 3: Module-wise activation patching results for BLIP and LLaVA on "objects" from SVO-Probes. We visualize the restoration probability after patching for MLP, self-attention, and cross-attention layers in the image-grounded text-encoder. The y-axis denotes which token we patch, and the x-axis denotes which layer we patch.
  • Figure 4: Module-wise activation patching results for SIP, and Gaussian-Noise corruption on SVO-Probes on BLIP. SIP corruption produces activation patterns that align with Stable Diffusion results and highlight the importance of middle layers, while Gaussian noise fails to reveal meaningful attention layers, emphasizing the effectiveness of SIP for probing vision-language models.
  • Figure 5: Logit difference demonstrating the impact of patching the correct answer for each LLaVA self-attention head and BLIP cross-attention head on the SVO-Probes, MIT States, and the Facial Expressions datasets. Many key attention heads overlap in importance across both modalities.
  • ...and 15 more figures