Table of Contents
Fetching ...

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier, Przemyslaw Biecek

TL;DR

This work tackles explainability of vision–language encoders by moving beyond first-order saliency to a faithful, interaction-based decomposition of image–text similarity using FIxLIP. It formulates a $2$-additive game and develops a cross-modal, $p$-weighted masking strategy with unbiased estimators that efficiently compute cross-modal and intra-modal interactions. Three metrics—$p$-faithfulness, area between insertion/deletion curves, and pointing game recognition—are extended to second-order explanations and validated on MS COCO and ImageNet-1k, showing FIxLIP outperforms baselines and scales to large models. The approach enables reliable model interpretation and cross-model comparisons, with practical impact for debugging and safety in high-stakes settings.

Abstract

Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, such as the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on the MS COCO and ImageNet-1k benchmarks validate that second-order methods, such as FIxLIP, outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models, e.g. CLIP vs. SigLIP-2.

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

TL;DR

This work tackles explainability of vision–language encoders by moving beyond first-order saliency to a faithful, interaction-based decomposition of image–text similarity using FIxLIP. It formulates a -additive game and develops a cross-modal, -weighted masking strategy with unbiased estimators that efficiently compute cross-modal and intra-modal interactions. Three metrics—-faithfulness, area between insertion/deletion curves, and pointing game recognition—are extended to second-order explanations and validated on MS COCO and ImageNet-1k, showing FIxLIP outperforms baselines and scales to large models. The approach enables reliable model interpretation and cross-model comparisons, with practical impact for debugging and safety in high-stakes settings.

Abstract

Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, such as the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on the MS COCO and ImageNet-1k benchmarks validate that second-order methods, such as FIxLIP, outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models, e.g. CLIP vs. SigLIP-2.

Paper Structure

This paper contains 35 sections, 4 theorems, 32 equations, 17 figures, 4 tables.

Key Result

Theorem 1

For $i \in N_\mathcal{I} \cup N_\mathcal{T}$ and $\mathbf{e}^{\textsc{FIxLIP}\text{-}p}$, the first-order attribution values are given by $\mathbf{e}_i + p \sum_{j\in N_\mathcal{I} \cup N_\mathcal{T}: j\neq i} \mathbf{e}_{\{i,j\}}$, which are the weighted Banzhaf values of $\hat{\nu}$.

Figures (17)

  • Figure 1: Explaining similarity in vision--language encoders with weighted Banzhaf interactions. We propose a cross-modal sampling strategy to efficiently query the model for $m^2$ game values from $m$ coalitions, and $p$-weighted masking to circumvent querying the model on out-of-distribution inputs. A regression-based approximation with weighted least squares (WLS) of second-order attributions gives a faithful decomposition of the predicted similarity score. Explanations of cross-modal and intra--modal interactions can be visualized and analyzed to interpret the CLIP's similarity prediction. The red values denote positive interactions contributing to an increase in similarity, while blue denotes interactions between tokens contributing to a decrease in similarity.
  • Figure 2: Visual comparison between FIxLIP and baselines. First-order attribution methods, e.g. GAME and Grad-ECLIP, lack the tools to faithfully explain complex similarity predictions of vision--language encoders like CLIP. Notably, in this example, the text token antis the most important for the similarity prediction. One of the differences from exCLIP is that we include intra-modal and main effects in the approximation, which are crucial for obtaining faithful interaction explanations.
  • Figure 3: Insertion/deletion curves for CLIP (ViT-B/32) on MS COCO. AID score (higher is better) for FIxLIP against alternative explanation methods, where a random baseline scores $0$. The y-axis is normalized between the model's prediction on the original input ($100\%$) and the fully removed one ($0\%$), where negative values denote that the model is predicting the image--text inputs are unsimilar. It means the similarity prediction on a partially masked input is smaller than the prediction on the fully masked input. Methods such as Grad-ECLIP and exCLIP fail to recover nonlinear rankings of important tokens, while our method faithfully recovers the optimal subset explanation. Extended results for CLIP (ViT-B/16) and SigLIP-2 (ViT-B/32) are in Figures \ref{['fig:insertion_deletion_clip16']} & \ref{['fig:insertion_deletion_siglip32']}.
  • Figure 4: $\boldsymbol{p}$-faithfulness correlation for CLIP (ViT-B/32) on MS COCO. Correlation for different variants of FIxLIP against other explanation methods (left). Game-theoretical approaches can also be evaluated with the $R^2$ coefficient (right). Extended results for CLIP (ViT-B/16) are in Figure \ref{['fig:appendix_faithfulness']}.
  • Figure 5: Computation time vs. budget for the FIxLIP explanation of SigLIP-2 (ViT-B/32), including game evaluations (model inference).
  • ...and 12 more figures

Theorems & Definitions (20)

  • Definition 1: Explanation
  • Definition 2: Masking
  • Definition 3: Game
  • Definition 4: $p$-faithfulness
  • Remark 1
  • Definition 5: FIxLIP-$p$
  • Remark 2
  • Theorem 1: First-order conversion
  • Definition 6: Model-agnostic estimator
  • Definition 7: Cross-modal estimator
  • ...and 10 more