Table of Contents
Fetching ...

AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments

Yang Zhang, Yawei Li, Hannah Brown, Mina Rezaei, Bernd Bischl, Philip Torr, Ashkan Khakzar, Kenji Kawaguchi

TL;DR

AttributionLab constructs a fully synthetic, controllable environment where both data and neural networks are designed to expose ground-truth feature attributions. It then uses a formal, model-agnostic faithfulness test to evaluate whether attribution maps align with the true features that drive the output, and it reveals that common perturbation-based evaluations can be unreliable under unseen data. The study evaluates several popular attribution methods (DeepSHAP, LIME, IG, GradCAM, IBA, ExPerturb, Occlusion) across signed and unsigned ground-truth scenarios, identifying when they succeed and where they fail. The results provide practical guidance for researchers on baselines, segmentation priors, and evaluation pitfalls, and offer a controlled stepping-stone toward more reliable explanations in real-world deployments.

Abstract

Feature attribution explains neural network outputs by identifying relevant input features. The attribution has to be faithful, meaning that the attributed features must mirror the input features that influence the output. One recent trend to test faithfulness is to fit a model on designed data with known relevant features and then compare attributions with ground truth input features.This idea assumes that the model learns to use all and only these designed features, for which there is no guarantee. In this paper, we solve this issue by designing the network and manually setting its weights, along with designing data. The setup, AttributionLab, serves as a sanity check for faithfulness: If an attribution method is not faithful in a controlled environment, it can be unreliable in the wild. The environment is also a laboratory for controlled experiments by which we can analyze attribution methods and suggest improvements.

AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments

TL;DR

AttributionLab constructs a fully synthetic, controllable environment where both data and neural networks are designed to expose ground-truth feature attributions. It then uses a formal, model-agnostic faithfulness test to evaluate whether attribution maps align with the true features that drive the output, and it reveals that common perturbation-based evaluations can be unreliable under unseen data. The study evaluates several popular attribution methods (DeepSHAP, LIME, IG, GradCAM, IBA, ExPerturb, Occlusion) across signed and unsigned ground-truth scenarios, identifying when they succeed and where they fail. The results provide practical guidance for researchers on baselines, segmentation priors, and evaluation pitfalls, and offer a controlled stepping-stone toward more reliable explanations in real-world deployments.

Abstract

Feature attribution explains neural network outputs by identifying relevant input features. The attribution has to be faithful, meaning that the attributed features must mirror the input features that influence the output. One recent trend to test faithfulness is to fit a model on designed data with known relevant features and then compare attributions with ground truth input features.This idea assumes that the model learns to use all and only these designed features, for which there is no guarantee. In this paper, we solve this issue by designing the network and manually setting its weights, along with designing data. The setup, AttributionLab, serves as a sanity check for faithfulness: If an attribution method is not faithful in a controlled environment, it can be unreliable in the wild. The environment is also a laboratory for controlled experiments by which we can analyze attribution methods and suggest improvements.
Paper Structure (70 sections, 2 theorems, 17 equations, 26 figures, 6 tables)

This paper contains 70 sections, 2 theorems, 17 equations, 26 figures, 6 tables.

Key Result

Proposition 3.1

(Sensitivity property) The addition/removal of any ground-truth pixel to/from the background affects the output of the model.

Figures (26)

  • Figure 1: Designing data and model to set up a controllable environment for testing the faithfulness of attribution methods and analyzing their properties. To obtain the ground truth attribution, we explicitly design networks in tandem with inputs. The models follow conventional neural network designs and have sufficient complexity (shown in Table \ref{['table:model_summary']}). More synthetic environments, including different modules (e.g., a modulo computer) and different tasks (e.g., regression) are in the appendix \ref{['sec:appendix:detail_model_design']}. The faithfulness test performs a sanity check on attribution results in the synthetic setting with the ground truth attribution.
  • Figure 2: Faithfulness test in AttributionLab. An attribution ($3^{rd}$ to the last image) is faithful if it is aligned with the Ground Truth attribution ($2^{nd}$ image) for a given input ($1^{st}$ image). When attributions are not faithful in this controlled environment, how can they be reliable in the wild? AttributionLab can serve as a sanity check for faithfulness and a tool to analyze current and future attribution methods. In the following, we also analyze different factors that cause the misalignment. The exact setting for this figure is in Appendix \ref{['sec:appendix:setting_for_visualization']}.
  • Figure 3: Designing data is not enough. Example on the neural networks not learning the designated ground truth features in the synthetic dataset. In this example, designed ground truth features are both objects in the center and on the edge. Even though the model can achieve $100\%$ accuracy, our test shows that the model only learns to use designed features at the corner and ignore the central ground truth features (more detail in Appendix \ref{['sec:appendix:trained_nn_failure']}).
  • Figure 4: Computational graph illustration of our designed neural network modules. The left example shows a neural network of identifying number $5$, and the middle example shows a simple color detector for detecting RGB value $(255, 127, 0)$. In these two cases, blue boxes symbolize neurons, with their respective computations indicated within the box. ReLU activation is applied after each neuron, which is omitted in the figure. The right example demonstrates CNN operations to achieve accumulation using non-uniform kernel weights. More details can be found in Appendix \ref{['sec:appendix:detail_model_design']}.
  • Figure 5: Faithfulness test and visual examples of DeepSHAP. (a) Faithfulness test of DeepSHAP. (b) Visual examples in synthetic and real-world environments. According to (a) and (b), DeepSHAP correctly highlights the foreground pixels. However, it assigns both positive and negative attribution to these pixels, even when they have similar colors and close spatial locations.
  • ...and 21 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Proposition 3.1
  • Proposition 3.2