Table of Contents
Fetching ...

Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods

Panagiota Kiourti, Anu Singh, Preeti Duraipandian, Weichao Zhou, Wenchao Li

TL;DR

The paper addresses the biased evaluation of attribution robustness that conflates explanation quality with a model's adversarial robustness. It introduces Output Similarity-based Robustness (OSR), a metric that relies on a GAN-generated distribution of inputs whose logits remain within a specified delta of the original, to assess explanation stability independent of input perturbations. Through MNIST, CIFAR-10, and Chest X-ray experiments, OSR is shown to more objectively differentiate attribution methods and reveal biases in existing metrics like Robustness-Sr and fidelity. The work provides a practical framework for evaluating explanations and motivates further development of objective, decoupled robustness metrics for reliable XAI deployment.

Abstract

This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model's outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.

Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods

TL;DR

The paper addresses the biased evaluation of attribution robustness that conflates explanation quality with a model's adversarial robustness. It introduces Output Similarity-based Robustness (OSR), a metric that relies on a GAN-generated distribution of inputs whose logits remain within a specified delta of the original, to assess explanation stability independent of input perturbations. Through MNIST, CIFAR-10, and Chest X-ray experiments, OSR is shown to more objectively differentiate attribution methods and reveal biases in existing metrics like Robustness-Sr and fidelity. The work provides a practical framework for evaluating explanations and motivates further development of objective, decoupled robustness metrics for reliable XAI deployment.

Abstract

This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model's outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.

Paper Structure

This paper contains 18 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Distances of generated images from the original image $\bm{x}$ for: (a), (b) GAN-generated images; (c),(d) images with uniformly sampled noise; (e),(f) images with noise from $\mathcal{N}(0,0.03)$.
  • Figure 2: Distances between the prediction logit of the generated image and that of the original image.
  • Figure 3: The proposed GAN-based method for synthesizing similar inputs. The neural network $f$ is already trained and has fixed parameters during the training of the GAN.
  • Figure 4: GAN-generated MNIST images with prediction logit at most$\delta = 5.0$ far away from the prediction logit of the original image (on the left).
  • Figure 5: GAN-generated CIFAR10 images for the class 0 (airplane) with maximum logit at most$\delta=5.0$ far away from the maximum logit of the original image (left).
  • ...and 4 more figures