AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments
Yang Zhang, Yawei Li, Hannah Brown, Mina Rezaei, Bernd Bischl, Philip Torr, Ashkan Khakzar, Kenji Kawaguchi
TL;DR
AttributionLab constructs a fully synthetic, controllable environment where both data and neural networks are designed to expose ground-truth feature attributions. It then uses a formal, model-agnostic faithfulness test to evaluate whether attribution maps align with the true features that drive the output, and it reveals that common perturbation-based evaluations can be unreliable under unseen data. The study evaluates several popular attribution methods (DeepSHAP, LIME, IG, GradCAM, IBA, ExPerturb, Occlusion) across signed and unsigned ground-truth scenarios, identifying when they succeed and where they fail. The results provide practical guidance for researchers on baselines, segmentation priors, and evaluation pitfalls, and offer a controlled stepping-stone toward more reliable explanations in real-world deployments.
Abstract
Feature attribution explains neural network outputs by identifying relevant input features. The attribution has to be faithful, meaning that the attributed features must mirror the input features that influence the output. One recent trend to test faithfulness is to fit a model on designed data with known relevant features and then compare attributions with ground truth input features.This idea assumes that the model learns to use all and only these designed features, for which there is no guarantee. In this paper, we solve this issue by designing the network and manually setting its weights, along with designing data. The setup, AttributionLab, serves as a sanity check for faithfulness: If an attribution method is not faithful in a controlled environment, it can be unreliable in the wild. The environment is also a laboratory for controlled experiments by which we can analyze attribution methods and suggest improvements.
