Table of Contents
Fetching ...

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Saebom Leem, Hyunseok Seo

TL;DR

This paper tackles explainability for Vision Transformers (ViT) by introducing an attention-guided gradient CAM that leverages gradients from the MLP head through skip connections and uses sigmoid-normalized self-attention maps as guides. The final class activation map is $L^c = \sum_{k=1}^{K} \sum_{h=1}^{H} F_h^k \odot \text{ReLU}(\alpha_h^{k,c})$ with $F_h^k = G(A_{h,1}^k)$, where $\alpha_h^{k,c}$ propagates through the architecture while softmax is replaced by sigmoid to mitigate peak intensities. The approach outperforms Attention Rollout and LRP-based ViT explainability on ImageNet, Pascal VOC, and CUB200 in weakly-supervised localization, achieving higher pixel accuracy, IoU, and Dice, and showing robust multi-instance localization and faithful visual explanations. The method yields class-specific heatmaps that align well with object regions and generalize across single- and multi-object images, making ViT explanations more reliable for localization tasks. These results imply practical impact for applications requiring faithful, end-to-end explanations and weakly-supervised object localization with ViT models.

Abstract

Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

TL;DR

This paper tackles explainability for Vision Transformers (ViT) by introducing an attention-guided gradient CAM that leverages gradients from the MLP head through skip connections and uses sigmoid-normalized self-attention maps as guides. The final class activation map is with , where propagates through the architecture while softmax is replaced by sigmoid to mitigate peak intensities. The approach outperforms Attention Rollout and LRP-based ViT explainability on ImageNet, Pascal VOC, and CUB200 in weakly-supervised localization, achieving higher pixel accuracy, IoU, and Dice, and showing robust multi-instance localization and faithful visual explanations. The method yields class-specific heatmaps that align well with object regions and generalize across single- and multi-object images, making ViT explanations more reliable for localization tasks. These results imply practical impact for applications requiring faithful, end-to-end explanations and weakly-supervised object localization with ViT models.

Abstract

Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.
Paper Structure (12 sections, 19 equations, 9 figures, 6 tables)

This paper contains 12 sections, 19 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The illustration of peak intensity propagation from the self-attention scores to final visualization heatmaps of PASCAL VOC 2012. Raw attention is a simple sum aggregation of the self-attention scores of all layers.
  • Figure 2: The demonstration of the ViT architecture and the major components in our method. The yellow shaded lines represent the essential gradients being considered along the skip connections propagated from the classification output of the given class $c$, $y^c$. The purple-colored boxes point to the self-attention score matrices which are the result of matrix multiplication of the query and the key matrices. The feature maps are these self-attention score matrices normalized with sigmoid, which are represented as the green boxes in each block. These feature maps are aggregated with the gradients to provide the final class activation map.
  • Figure 3: The demonstration of the results of softmax and sigmoid operation applied on self-attention. The images (left) are the sum aggregation of self-attention of all layers and heads with each operation and the peaks are indicated with red boxes. The graphs (right) are the distributions of the flattened self-attention scores in the left images. The peaks are indicated with red lines.
  • Figure 4: The illustration of how the one-dimensional matrix $L^c_1$ is reshaped into a two-dimensional class activation map.
  • Figure 5: The heatmaps on ImageNet ILSVRC 2012, Pascal VOC 2012, and CUB 200 dataset generated by each of the methods. The first images in each dataset demonstrate the peak intensities generated on a homogeneous non-object background in Attention Rollout and LRP-based method and the reduced peak intensities in our method. The second and third images in ILSVRC 2012 and PASCAL VOC show the localization performance of each method on single-instance and multiple-instance images, respectively. CUB200 consists of single-instance images only and its second and third images include one object instance per image.
  • ...and 4 more figures