Table of Contents
Fetching ...

RISE: Randomized Input Sampling for Explanation of Black-box Models

Vitali Petsiuk, Abir Das, Kate Saenko

TL;DR

RISE introduces a true black-box approach to explain image classifiers by estimating pixel saliency through randomized input masking and Monte Carlo weighting of masks. It avoids internal model access, outperforming several white-box and perturbation-based methods on automatic causal metrics (deletion/insertion) and showing competitive human-centric evaluations. The method extends naturally to captioning, demonstrating versatility across vision tasks. This work provides practical, architecture-agnostic explanations with quantified causal interpretation, at the cost of higher computational requirements.

Abstract

Deep neural networks are being used increasingly to automate data analysis and decision making, yet their decision-making process is largely unclear and is difficult to explain to the end users. In this paper, we address the problem of Explainable AI for deep neural networks that take images as input and output a class probability. We propose an approach called RISE that generates an importance map indicating how salient each pixel is for the model's prediction. In contrast to white-box approaches that estimate pixel importance using gradients or other internal network state, RISE works on black-box models. It estimates importance empirically by probing the model with randomly masked versions of the input image and obtaining the corresponding outputs. We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments. Extensive experiments on several benchmark datasets show that our approach matches or exceeds the performance of other methods, including white-box approaches. Project page: http://cs-people.bu.edu/vpetsiuk/rise/

RISE: Randomized Input Sampling for Explanation of Black-box Models

TL;DR

RISE introduces a true black-box approach to explain image classifiers by estimating pixel saliency through randomized input masking and Monte Carlo weighting of masks. It avoids internal model access, outperforming several white-box and perturbation-based methods on automatic causal metrics (deletion/insertion) and showing competitive human-centric evaluations. The method extends naturally to captioning, demonstrating versatility across vision tasks. This work provides practical, architecture-agnostic explanations with quantified causal interpretation, at the cost of higher computational requirements.

Abstract

Deep neural networks are being used increasingly to automate data analysis and decision making, yet their decision-making process is largely unclear and is difficult to explain to the end users. In this paper, we address the problem of Explainable AI for deep neural networks that take images as input and output a class probability. We propose an approach called RISE that generates an importance map indicating how salient each pixel is for the model's prediction. In contrast to white-box approaches that estimate pixel importance using gradients or other internal network state, RISE works on black-box models. It estimates importance empirically by probing the model with randomly masked versions of the input image and obtaining the corresponding outputs. We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments. Extensive experiments on several benchmark datasets show that our approach matches or exceeds the performance of other methods, including white-box approaches. Project page: http://cs-people.bu.edu/vpetsiuk/rise/

Paper Structure

This paper contains 12 sections, 7 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Our proposed RISE approach can explain why a black-box model (here, ResNet50) makes classification decisions by generating a pixel importance map for each decision (redder is more important). For the top image, it reveals that the model only recognizes the white sheep and confuses the black one with a cow; for the bottom image it confuses parts of birds with a person. (Images taken from the PASCAL VOC dataset.)
  • Figure 2: Estimation of importance of each pixel by RISE and other state-of-the-art methods for a base model's prediction along with 'deletion' scores (AUC). The top row shows an input image (from ImageNet) and saliency maps produced by RISE, Grad-CAM Selvaraju2017Gradcam and LIME Ribeiro2016Should with ResNet50 as the base network (redder values indicate higher importance). The bottom row illustrates the deletion metric: salient pixels are gradually masked from the image (\ref{['fig:RISEmask']}) in order of decreasing importance, and the probability of the 'goldfish' class predicted by the network is plotted vs. the fraction of removed pixels. In this example, RISE provides more accurate saliency and achieves the lowest AUC.
  • Figure 3: Overview of RISE: Input image $I$ is element-wise multiplied with random masks $M_i$ and the masked images are fed to the base model. The saliency map is a linear combination of the masks where the weights come from the score of the target class corresponding to the respective masked inputs.
  • Figure 4: RISE-generated importance maps (second column) for two representative images (first column) with deletion (third column) and insertion (fourth column) curves.
  • Figure 5: Explanations of image captioning models. \ref{['subfig:Original']} is the image with the caption generated by Donahue2015Long. \ref{['subfig:Horse']} and \ref{['subfig:Carriage']} show the importance map generated by RISE for two words 'horse' and 'carriage' respectively from the generated caption. \ref{['subfig:White']} shows the importance map for an arbitrary word 'white'.
  • ...and 4 more figures