Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods
Panagiota Kiourti, Anu Singh, Preeti Duraipandian, Weichao Zhou, Wenchao Li
TL;DR
The paper addresses the biased evaluation of attribution robustness that conflates explanation quality with a model's adversarial robustness. It introduces Output Similarity-based Robustness (OSR), a metric that relies on a GAN-generated distribution of inputs whose logits remain within a specified delta of the original, to assess explanation stability independent of input perturbations. Through MNIST, CIFAR-10, and Chest X-ray experiments, OSR is shown to more objectively differentiate attribution methods and reveal biases in existing metrics like Robustness-Sr and fidelity. The work provides a practical framework for evaluating explanations and motivates further development of objective, decoupled robustness metrics for reliable XAI deployment.
Abstract
This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model's outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.
