On the Robustness of Interpretability Methods
David Alvarez-Melis, Tommi S. Jaakkola
TL;DR
The work tackles the reliability of interpretability methods by formalizing local robustness via Lipschitz continuity and introducing concrete metrics to quantify how explanations change with near-neighborhood input perturbations. It systematically compares gradient-based and perturbation-based explanations (e.g., LIME, SHAP, saliency methods) across diverse datasets (UCI, Compas, MNIST, ImageNet), finding widespread instability, especially for model-agnostic approaches. The study shows that robustness gaps persist even when model predictions remain stable and proposes strategies to enforce robustness, including robust explanation training and adversarial-training-inspired techniques, along with guidance for future robust evaluation. Overall, the paper provides a framework for measuring and improving the reliability of explanations in practical settings.
Abstract
We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.
