Table of Contents
Fetching ...

Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On

Roni Goldshmidt

TL;DR

PixelSHAP introduces object-level Shapley-based explanations for Vision-Language Models, enabling model-agnostic attribution without internal access. By perturbing segmented image objects and comparing responses via embedding similarity, it yields principled object importances suitable for auditing and debugging in high-stakes domains like autonomous driving. The work demonstrates that the Bounding Box with Overlap Avoidance (BBOA) masking strategy offers strong attribution performance and provides an open-source implementation. Overall, the approach links object recognition with multimodal interpretability, delivering scalable, context-sensitive explanations for text-generative VLMs.

Abstract

Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.

Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On

TL;DR

PixelSHAP introduces object-level Shapley-based explanations for Vision-Language Models, enabling model-agnostic attribution without internal access. By perturbing segmented image objects and comparing responses via embedding similarity, it yields principled object importances suitable for auditing and debugging in high-stakes domains like autonomous driving. The work demonstrates that the Bounding Box with Overlap Avoidance (BBOA) masking strategy offers strong attribution performance and provides an open-source implementation. Overall, the approach links object recognition with multimodal interpretability, delivering scalable, context-sensitive explanations for text-generative VLMs.

Abstract

Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.

Paper Structure

This paper contains 29 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Pixel SHAP simulation of a car crash. The left shows the scene, prompt, and GPT-4o response, and the right highlights object importance.
  • Figure 2: Overview of the PixelSHAP framework. The method systematically perturbs object groups, queries a vision-language model (VLM), and computes Shapley values to quantify object importance.
  • Figure 3: Sample images from our dataset, each with a question and a bounding box highlighting the object needed to answer it.
  • Figure 4: Comparison of three masking strategies: (a) Precise Masking follows object contours exactly but may leave recognizable silhouettes; (b) Bounding Box Masking completely occludes the object but may inadvertently mask neighboring objects; (c) Bounding Box with Overlap Avoidance (BBOA) masks the object while preserving the contours of adjacent objects.
  • Figure 5: Average performance of different masking strategies across all VLMs. BBOA consistently outperforms other methods on most metrics, particularly in object localization (IoU) and relevance identification (Recall).
  • ...and 1 more figures