Table of Contents
Fetching ...

Fixing confirmation bias in feature attribution methods via semantic match

Giovanni Cinà, Daniel Fernandez-Llaneza, Ludovico Deponte, Nishant Mishra, Tabea E. Röber, Sandro Pezzelle, Iacer Calixto, Rob Goedhart, Ş. İlker Birbil

TL;DR

It is argued that a structured approach is required to test whether the authors' hypotheses on the model are confirmed by the feature attributions, and this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.

Abstract

Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cinà et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.

Fixing confirmation bias in feature attribution methods via semantic match

TL;DR

It is argued that a structured approach is required to test whether the authors' hypotheses on the model are confirmed by the feature attributions, and this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.

Abstract

Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cinà et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
Paper Structure (11 sections, 7 equations, 18 figures)

This paper contains 11 sections, 7 equations, 18 figures.

Figures (18)

  • Figure 1: Two histograms representing the distribution of explanations for a specific feature, for all points satisfying condition $A$. When assessing the validity of a hypothesis with condition $\bm{e} \models B$ iff $\bm{e}\in [0.3, 1]$, the hypothesis will have the same validity in both cases. However, the distribution on the left is less coherent with $\theta$ compared to the distribution on the right.
  • Figure 2: An example image from the MALeViC dataset (left), alongside the SHAP values generated by the model (center-left), the cyan bounding box around the object of interest (center-right) and the SHAP values after masking is applied (right).
  • Figure 3: For the data points such that $x_1<0$ and $x_3 = 0$, we check how similar the explanations are with respect to $\bm{e}_c$. We visualize this as a histogram with the distance on the horizontal axis. Refining the notion of distance to one that is hypothesis-driven, we remove the "noise" introduced by $x_2$ and ascertain that the explanations do cluster in the vicinity of $\bm{e}_c$.
  • Figure 4: Boxplots of AUC and median distance for hypotheses based on contribution of the red target object in images from the MALeVIC dataset. The boxplots are obtained by sampling all data points complying with the hypothesis as points of reference.
  • Figure 5: AUC and median distance for hypotheses $\theta_1$ (top) and $\theta_3$ (bottom) with and different thresholds for the VOC2006 dataset.
  • ...and 13 more figures