On the Robustness of Interpretability Methods

David Alvarez-Melis; Tommi S. Jaakkola

On the Robustness of Interpretability Methods

David Alvarez-Melis, Tommi S. Jaakkola

TL;DR

The work tackles the reliability of interpretability methods by formalizing local robustness via Lipschitz continuity and introducing concrete metrics to quantify how explanations change with near-neighborhood input perturbations. It systematically compares gradient-based and perturbation-based explanations (e.g., LIME, SHAP, saliency methods) across diverse datasets (UCI, Compas, MNIST, ImageNet), finding widespread instability, especially for model-agnostic approaches. The study shows that robustness gaps persist even when model predictions remain stable and proposes strategies to enforce robustness, including robust explanation training and adversarial-training-inspired techniques, along with guidance for future robust evaluation. Overall, the paper provides a framework for measuring and improving the reliability of explanations in practical settings.

Abstract

We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.

On the Robustness of Interpretability Methods

TL;DR

Abstract

On the Robustness of Interpretability Methods

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)

Theorems & Definitions (1)