Table of Contents
Fetching ...

On the Robustness of Interpretability Methods

David Alvarez-Melis, Tommi S. Jaakkola

TL;DR

The work tackles the reliability of interpretability methods by formalizing local robustness via Lipschitz continuity and introducing concrete metrics to quantify how explanations change with near-neighborhood input perturbations. It systematically compares gradient-based and perturbation-based explanations (e.g., LIME, SHAP, saliency methods) across diverse datasets (UCI, Compas, MNIST, ImageNet), finding widespread instability, especially for model-agnostic approaches. The study shows that robustness gaps persist even when model predictions remain stable and proposes strategies to enforce robustness, including robust explanation training and adversarial-training-inspired techniques, along with guidance for future robust evaluation. Overall, the paper provides a framework for measuring and improving the reliability of explanations in practical settings.

Abstract

We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.

On the Robustness of Interpretability Methods

TL;DR

The work tackles the reliability of interpretability methods by formalizing local robustness via Lipschitz continuity and introducing concrete metrics to quantify how explanations change with near-neighborhood input perturbations. It systematically compares gradient-based and perturbation-based explanations (e.g., LIME, SHAP, saliency methods) across diverse datasets (UCI, Compas, MNIST, ImageNet), finding widespread instability, especially for model-agnostic approaches. The study shows that robustness gaps persist even when model predictions remain stable and proposes strategies to enforce robustness, including robust explanation training and adversarial-training-inspired techniques, along with guidance for future robust evaluation. Overall, the paper provides a framework for measuring and improving the reliability of explanations in practical settings.

Abstract

We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.

Paper Structure

This paper contains 8 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Lime and Shap explanations for two simple binary classifiers: a linear SVM (top row) and a two-layer neural network (bottom). The heatmaps depict the models' positive-class probability level sets, and the barchart inserts show the interpreters' explanations (attribution values for $x$ in green and $y$ in purple) for test point predictions. While both Lime and Shap's explanations for the linear model are stable, for the non-linear model (bottom) they vary significantly within small neighborhoods.
  • Figure 2: Local Lipschitz estimates \ref{['eq:eval_metric_continuous']} computed on 100 test points on various UCI classification datasets.
  • Figure 3: Top: example $x_i$ from the Boston dataset and its explanations (attributions). Bottom: explanations for the maximizer of the Lipschitz estimate $L(x_i)$ as per \ref{['eq:eval_metric_continuous']}.
  • Figure 4: Robustness upon explaining a classifier on the Compas dataset. The two rows correspond to the pair maximizing $\tilde{L}_X$\ref{['eq:eval_metric_discrete']} over the entire test fold, with $\epsilon=0.1$.
  • Figure 5: Explanations of a CNN model prediction's on a example Mnist digit (top row) and three versions with Gaussian noise added to it. The perturbed input digits are labeled with the probability assigned to the predicted class by the classifier. Here $\delta$ is the ratio $\| f(x) - f(x')\|_2/\| x- x'\|_2$ for the perturbed $x'$, which are not adversarially chosen as in \ref{['eq:eval_metric_continuous']}.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 2.1