Table of Contents
Fetching ...

Benchmarking Attribution Methods with Relative Feature Importance

Mengjiao Yang, Been Kim

TL;DR

The paper tackles the challenge of evaluating feature-attribution methods in the absence of ground-truth importance by introducing BAM, a framework with a semi-natural dataset and models trained to encode known relative feature importance. It provides three quantitative metrics—Model Contrast Score, Input Dependence Rate, and Input Independence Rate—to assess whether attribution methods correctly reflect relative importance between features, inputs, and functionally similar inputs. Empirical results show that some popular methods (e.g., Grad-CAM, TCAV) perform well on certain metrics while others (e.g., GB, IG variants) exhibit systematic false positives, and that rankings vary by metric. The work demonstrates a practical, scalable pre-check for attribution methods, opens-source resources, and a path for designing additional evaluation measures that better align explanations with model rationale and real-world usage.

Abstract

Interpretability is an important area of research for safe deployment of machine learning systems. One particular type of interpretability method attributes model decisions to input features. Despite active development, quantitative evaluation of feature attribution methods remains difficult due to the lack of ground truth: we do not know which input features are in fact important to a model. In this work, we propose a framework for Benchmarking Attribution Methods (BAM) with a priori knowledge of relative feature importance. BAM includes 1) a carefully crafted dataset and models trained with known relative feature importance and 2) three complementary metrics to quantitatively evaluate attribution methods by comparing feature attributions between pairs of models and pairs of inputs. Our evaluation on several widely-used attribution methods suggests that certain methods are more likely to produce false positive explanations---features that are incorrectly attributed as more important to model prediction. We open source our dataset, models, and metrics.

Benchmarking Attribution Methods with Relative Feature Importance

TL;DR

The paper tackles the challenge of evaluating feature-attribution methods in the absence of ground-truth importance by introducing BAM, a framework with a semi-natural dataset and models trained to encode known relative feature importance. It provides three quantitative metrics—Model Contrast Score, Input Dependence Rate, and Input Independence Rate—to assess whether attribution methods correctly reflect relative importance between features, inputs, and functionally similar inputs. Empirical results show that some popular methods (e.g., Grad-CAM, TCAV) perform well on certain metrics while others (e.g., GB, IG variants) exhibit systematic false positives, and that rankings vary by metric. The work demonstrates a practical, scalable pre-check for attribution methods, opens-source resources, and a path for designing additional evaluation measures that better align explanations with model rationale and real-world usage.

Abstract

Interpretability is an important area of research for safe deployment of machine learning systems. One particular type of interpretability method attributes model decisions to input features. Despite active development, quantitative evaluation of feature attribution methods remains difficult due to the lack of ground truth: we do not know which input features are in fact important to a model. In this work, we propose a framework for Benchmarking Attribution Methods (BAM) with a priori knowledge of relative feature importance. BAM includes 1) a carefully crafted dataset and models trained with known relative feature importance and 2) three complementary metrics to quantitatively evaluate attribution methods by comparing feature attributions between pairs of models and pairs of inputs. Our evaluation on several widely-used attribution methods suggests that certain methods are more likely to produce false positive explanations---features that are incorrectly attributed as more important to model prediction. We open source our dataset, models, and metrics.

Paper Structure

This paper contains 37 sections, 10 equations, 16 figures.

Figures (16)

  • Figure 1: BAM dataset examples and BAM models. The object neural network ($f_o$) is trained with object labels ($L_o$) and the scene neural network ($f_s$) is trained with scene labels ($L_s$).
  • Figure 2: [Top] Verifying relative feature importance between $f_o$ and $f_s$. Objects are more important to $f_o$ than they are to $f_s$. Scenes are more important to $f_s$ than to $f_o$. [Bottom] Test accuracy of bamboo forest with and without dog CF on models trained with $\{X_{o,s}^k\}$ for $k \in \{0.1, \dots, 1.0\}$.
  • Figure 3: An example of saliency map visualizations for $f_o$ and $f_s$. From qualitative examination alone, it is hard to rank method performance.
  • Figure 4: MCS between $f_o$ and $f_s$. Blue bars are measurements from the original BAM dataset. Red bars show robustness of this measure. Yellow bars are baselines. Numbers on top are standard deviations. Higher MCS is better.
  • Figure 5: An example of saliency map visualizations for models trained with CF s of different $k$. $k$ increases from left to right. A larger contrast among each row is better. (Full size figure in Appendix.)
  • ...and 11 more figures