Benchmarking Attribution Methods with Relative Feature Importance

Mengjiao Yang; Been Kim

Benchmarking Attribution Methods with Relative Feature Importance

Mengjiao Yang, Been Kim

TL;DR

The paper tackles the challenge of evaluating feature-attribution methods in the absence of ground-truth importance by introducing BAM, a framework with a semi-natural dataset and models trained to encode known relative feature importance. It provides three quantitative metrics—Model Contrast Score, Input Dependence Rate, and Input Independence Rate—to assess whether attribution methods correctly reflect relative importance between features, inputs, and functionally similar inputs. Empirical results show that some popular methods (e.g., Grad-CAM, TCAV) perform well on certain metrics while others (e.g., GB, IG variants) exhibit systematic false positives, and that rankings vary by metric. The work demonstrates a practical, scalable pre-check for attribution methods, opens-source resources, and a path for designing additional evaluation measures that better align explanations with model rationale and real-world usage.

Abstract

Interpretability is an important area of research for safe deployment of machine learning systems. One particular type of interpretability method attributes model decisions to input features. Despite active development, quantitative evaluation of feature attribution methods remains difficult due to the lack of ground truth: we do not know which input features are in fact important to a model. In this work, we propose a framework for Benchmarking Attribution Methods (BAM) with a priori knowledge of relative feature importance. BAM includes 1) a carefully crafted dataset and models trained with known relative feature importance and 2) three complementary metrics to quantitatively evaluate attribution methods by comparing feature attributions between pairs of models and pairs of inputs. Our evaluation on several widely-used attribution methods suggests that certain methods are more likely to produce false positive explanations---features that are incorrectly attributed as more important to model prediction. We open source our dataset, models, and metrics.

Benchmarking Attribution Methods with Relative Feature Importance

TL;DR

Abstract

Benchmarking Attribution Methods with Relative Feature Importance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)