Table of Contents
Fetching ...

Grad-CAM: Why did you say that?

Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

TL;DR

The paper tackles the interpretability gap in CNN-based vision systems by introducing Grad-CAM, a gradient-based localization method that yields class-discriminative heatmaps without changing network architectures. It generalizes Class Activation Mapping (CAM) by using gradients to weight convolutional feature maps and combines these with guided backpropagation to produce Guided Grad-CAM, a high-resolution visualization. The authors validate explanations across image captioning and visual question answering, plus human studies showing improved discriminability and trust, and assess faithfulness via occlusion comparisons. They provide released code and demos to facilitate adoption and further research in model understanding.

Abstract

We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. Our approach, called Gradient-weighted Class Activation Mapping (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space visualizations to create a novel high-resolution and class-discriminative visualization called Guided Grad-CAM. These methods help better understand CNN-based models, including image captioning and visual question answering (VQA) models. We evaluate our visual explanations by measuring their ability to discriminate between classes, to inspire trust in humans, and their correlation with occlusion maps. Grad-CAM provides a new way to understand CNN-based models. We have released code, an online demo hosted on CloudCV, and a full version of this extended abstract.

Grad-CAM: Why did you say that?

TL;DR

The paper tackles the interpretability gap in CNN-based vision systems by introducing Grad-CAM, a gradient-based localization method that yields class-discriminative heatmaps without changing network architectures. It generalizes Class Activation Mapping (CAM) by using gradients to weight convolutional feature maps and combines these with guided backpropagation to produce Guided Grad-CAM, a high-resolution visualization. The authors validate explanations across image captioning and visual question answering, plus human studies showing improved discriminability and trust, and assess faithfulness via occlusion comparisons. They provide released code and demos to facilitate adoption and further research in model understanding.

Abstract

We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. Our approach, called Gradient-weighted Class Activation Mapping (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space visualizations to create a novel high-resolution and class-discriminative visualization called Guided Grad-CAM. These methods help better understand CNN-based models, including image captioning and visual question answering (VQA) models. We evaluate our visual explanations by measuring their ability to discriminate between classes, to inspire trust in humans, and their correlation with occlusion maps. Grad-CAM provides a new way to understand CNN-based models. We have released code, an online demo hosted on CloudCV, and a full version of this extended abstract.

Paper Structure

This paper contains 7 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Grad-CAM overview: Given an image, and a category ('tiger cat') as input, we foward propagate the image through the model to obtain the raw class scores before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the rectified convolutional feature map of interest, where we can compute the coarse Grad-CAM localization (blue heatmap). Finally, we pointwise multiply the heatmap with guided backpropagation to get Guided Grad-CAM visualizations which are both high-resolution and class-discriminative.
  • Figure 2: In these cases the model (VGG-16) failed to predict the correct class as its top 1 prediction, but it even failed to predict the correct class in its top 5 for figure b. All of these errors are due in part to class ambiguity. (a) For example, the network predicts 'sandbar' based on the foreground of (a), but it also knows where the correct label 'volcano' is located. (c-d) In other cases, these errors are still reasonable but not immediately apparent. For example, humans would find it hard to explain the predicted 'syringes' in (c) without looking at the visualization for the predicted class.
  • Figure 3: Interpreting image captioning models: We use our class-discriminative localization technique, Grad-CAM to find spatial support regions for captions in images. Fig. \ref{['fig:captioning']} Visual explanations from image captioning model karpathy2015deep highlighting image regions considered to be important for producing the captions. Fig. \ref{['fig:densecap']} Grad-CAM localizations of a global or holistic captioning model for captions generated by a dense captioning model johnson_cvpr16 for the three bounding box proposals marked on the left. We can see that we get back Grad-CAM localizations (right) that agree with those bounding boxes -- even though the captioning model and Grad-CAM do not use any bounding box annotations.
  • Figure 4: VQA visualizations: (a) Given the image on the left and the question "What color is the firehydrant?", we visualize Grad-CAMs and Guided Grad-CAMs for the answers "red", "yellow" and "yellow and red". Our visualizations are highly interpretable and help explain the model's predictions -- for "red", the model focuses on the bottom red part of the firehydrant; when forced to answer "yellow", the model concentrates on it's top yellow cap, and when forced to answer "yellow and red", it looks at the whole firehydrant! (b) Grad-CAM for ResNet-based VQA model.