Table of Contents
Fetching ...

Interpretable Explanations of Black Boxes by Meaningful Perturbation

Ruth Fong, Andrea Vedaldi

TL;DR

<3-5 sentence high-level summary>Addresses the need for interpretable explanations of black-box predictors by proposing explanations as meta-predictors that can be learned from data. The paper develops a model-agnostic, perturbation-based framework that treats explanations as programs predicting f's responses and evaluates faithfulness through prediction error. It specializes the framework to a saliency paradigm based on meaningful image perturbations, including deletion and preservation games solved via iterated gradients with artifact-mitigation regularizers. Empirical results show interpretable, minimal deletions that localize decision-relevant regions, reveal non-obvious correlations, and offer adversarial-defense insights and localization capabilities.

Abstract

As machine learning algorithms are increasingly applied to high impact yet high risk tasks, such as medical diagnosis or autonomous driving, it is critical that researchers can explain how such algorithms arrived at their predictions. In recent years, a number of image saliency methods have been developed to summarize where highly complex neural networks "look" in an image for evidence for their predictions. However, these techniques are limited by their heuristic nature and architectural constraints. In this paper, we make two main contributions: First, we propose a general framework for learning different kinds of explanations for any black box algorithm. Second, we specialise the framework to find the part of an image most responsible for a classifier decision. Unlike previous works, our method is model-agnostic and testable because it is grounded in explicit and interpretable image perturbations.

Interpretable Explanations of Black Boxes by Meaningful Perturbation

TL;DR

<3-5 sentence high-level summary>Addresses the need for interpretable explanations of black-box predictors by proposing explanations as meta-predictors that can be learned from data. The paper develops a model-agnostic, perturbation-based framework that treats explanations as programs predicting f's responses and evaluates faithfulness through prediction error. It specializes the framework to a saliency paradigm based on meaningful image perturbations, including deletion and preservation games solved via iterated gradients with artifact-mitigation regularizers. Empirical results show interpretable, minimal deletions that localize decision-relevant regions, reveal non-obvious correlations, and offer adversarial-defense insights and localization capabilities.

Abstract

As machine learning algorithms are increasingly applied to high impact yet high risk tasks, such as medical diagnosis or autonomous driving, it is critical that researchers can explain how such algorithms arrived at their predictions. In recent years, a number of image saliency methods have been developed to summarize where highly complex neural networks "look" in an image for evidence for their predictions. However, these techniques are limited by their heuristic nature and architectural constraints. In this paper, we make two main contributions: First, we propose a general framework for learning different kinds of explanations for any black box algorithm. Second, we specialise the framework to find the part of an image most responsible for a classifier decision. Unlike previous works, our method is model-agnostic and testable because it is grounded in explicit and interpretable image perturbations.

Paper Structure

This paper contains 22 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An example of a mask learned (right) by blurring an image (middle) to suppress the softmax probability of its target class (left: original image; softmax scores above images).
  • Figure 2: Comparison with other saliency methods. From left to right: original image with ground truth bounding box, learned mask subtracted from 1 (our method), gradient-based saliency simonyan14deep, guided backprop springenberg2014strivingmahendran2016salient, contrastive excitation backprop zhang2016top, Grad-CAM selvaraju2016grad, and occlusion zeiler2014visualizing.
  • Figure 3: Gradient saliency maps of simonyan14deep. A red bounding box highlight the object which is meant to be recognized in the image. Note the strong response in apparently non-relevant image regions.
  • Figure 4: Perturbation types. Bottom: perturbation mask; top: effect of blur, constant, and noise perturbations.
  • Figure 5: From left to right: an image correctly classified with large confidence by GoogLeNet szegedy2015going; a perturbed image that is not recognized correctly anymore; the deletion mask learned with artifacts. Top: A mask learned by minimizing the top five predicted classes by jointly applying the constant, random noise, and blur perturbations. Note that the mask learns to add highly structured swirls along the rim of the cup ($\gamma = 1, {\lambda}_1 = 10^{-5}, {\lambda}_2 = 10^{-3}, \beta = 3$). Bottom: A minimizing-top5 mask learned by applying a constant perturbation. Notice that the mask learns to introduce sharp, unnatural artifacts in the sky instead of deleting the pole ($\gamma = 0.1, {\lambda}_1 = 10^{-4}, {\lambda}_2 = 10^{-2}, \beta = 3$).
  • ...and 5 more figures