Table of Contents
Fetching ...

Interpretation of Neural Networks is Fragile

Amirata Ghorbani, Abubakar Abid, James Zou

TL;DR

The paper demonstrates that neural network explanations can be highly fragile: small, perceptually imperceptible input perturbations can substantially alter feature-importance and exemplar-based interpretations without changing the predicted label. It formalizes adversarial perturbations against interpretations, develops iterative and gradient-based attacks, and evaluates them on ImageNet and CIFAR-10 across saliency methods like gradients, DeepLIFT, and integrated gradients as well as influence functions. The Hessian-based analysis provides intuition for why high-dimensional, nonlinear models produce unstable explanations. The work highlights serious considerations for trusting and deploying interpretable AI, and discusses potential defenses to improve robustness of explanations. It emphasizes that robust interpretability is a distinct and important objective alongside predictive accuracy.

Abstract

In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.

Interpretation of Neural Networks is Fragile

TL;DR

The paper demonstrates that neural network explanations can be highly fragile: small, perceptually imperceptible input perturbations can substantially alter feature-importance and exemplar-based interpretations without changing the predicted label. It formalizes adversarial perturbations against interpretations, develops iterative and gradient-based attacks, and evaluates them on ImageNet and CIFAR-10 across saliency methods like gradients, DeepLIFT, and integrated gradients as well as influence functions. The Hessian-based analysis provides intuition for why high-dimensional, nonlinear models produce unstable explanations. The work highlights serious considerations for trusting and deploying interpretable AI, and discusses potential defenses to improve robustness of explanations. It emphasizes that robust interpretability is a distinct and important objective alongside predictive accuracy.

Abstract

In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.

Paper Structure

This paper contains 29 sections, 17 equations, 14 figures, 1 algorithm.

Figures (14)

  • Figure 1: Adversarial attack against feature-importance maps. We generate feature-importance scores, also called saliency maps, using three popular interpretation methods: (a) simple gradients, (b) DeepLIFT, and (c) integrated gradients. The top row shows the the original images and their saliency maps and the bottom row shows the perturbed images (using the center attack with $\epsilon=8$, as described in Section \ref{['sec:attack']}) and corresponding saliency maps. In all three images, the predicted label does not change from the perturbation; however, the saliency maps of the perturbed images shifts dramatically to features that would not be considered salient by human perception.
  • Figure 2: Intuition for why interpretation is fragile. Consider a test example $\boldsymbol{x_t} \in \mathbb{R}^2$ (solid black circle) that is slightly perturbed to a new position $\boldsymbol{x_t}+\boldsymbol{\delta}$ in input space (dashed black dot). The contours and decision boundary corresponding to a loss function ($L$) for a two-class classification task are also shown, allowing one to see the direction of the gradient of the loss with respect to the input space. Neural networks with many parameters have decision boundaries that are roughly piecewise linear with many transitions goodfellow2014explaining. We illustrate that points near the transitions are especially fragile to interpretability-based analysis. A small perturbation to the input changes the direction of $\nabla_x L$ from being in the horizontal direction to being in the vertical direction, directly affecting feature-importance analyses. Similarly, a small perturbation to the test image changes which data point (before perturbation: solid blue, after perturbation: dashed blue), when up-weighted, has the largest influence on $L$, directly affecting exemplar-based analysis.
  • Figure 3: Comparison of adversarial attack algorithms on feature-importance methods. Across 512 correctly-classified ImageNet images, we find that the top-$k$ and center attacks perform similarly in top-1000 intersection and rank correlation measures, and are far more effective than the random sign perturbation at demonstrating the fragility of interpretability, as characterized through top-1000 intersection (top) as well as rank order correlation (bottom).
  • Figure 4: Gradient sign attack on influence functions . (a) An imperceptible perturbation to a test image can significantly affect sample importance interpretability. The original test image is that of a sunflower that is classified correctly in a rose vs. sunflower classification task. The top 3 training images identified by influence functions are shown in the top row. Using the gradient sign attack, we perturb the test image (with $\epsilon=8$) to produce the leftmost image in the bottom row. Although the image is even more confidently predicted as a sunflower, influence functions suggest very different training images by means of explanation: instead of the sunflowers and yellow petals that resemble the input image, the most influential images are pink/red roses. (b) Average results for applying random (green) and gradient sign-based (orange) perturbations to 200 test images are shown. Random attacks have a gentle effect on interpretability while a gradient perturbation can significantly affect the rank correlation and (c) the 5 most influential images. Although the image is even more confidently predicted to be a sunflower, influence functions suggest very different training images by means of explanation: instead of the sunflowers and yellow petals that resemble the input image, the most influential images are pink/red roses. The plot on the right shows the influence of each training image before and after perturbation. The 3 most influential images (targeted by the attack) have decreased in influence, but the influences of other images have also changed.
  • Figure 5: Targeted attack against feature importance map. Image is correctly classified as “trailer truck”. For all methods, the devised perturbation with $\epsilon=8$ was able to semantically meaningfully change the focus of saliency map to the “cloud” above the truck. (The cloud area was captured using SLIC achanta2012slic superpixel segementation.) (top) as well as rank order correlation (bottom).
  • ...and 9 more figures