Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention
Saebom Leem, Hyunseok Seo
TL;DR
This paper tackles explainability for Vision Transformers (ViT) by introducing an attention-guided gradient CAM that leverages gradients from the MLP head through skip connections and uses sigmoid-normalized self-attention maps as guides. The final class activation map is $L^c = \sum_{k=1}^{K} \sum_{h=1}^{H} F_h^k \odot \text{ReLU}(\alpha_h^{k,c})$ with $F_h^k = G(A_{h,1}^k)$, where $\alpha_h^{k,c}$ propagates through the architecture while softmax is replaced by sigmoid to mitigate peak intensities. The approach outperforms Attention Rollout and LRP-based ViT explainability on ImageNet, Pascal VOC, and CUB200 in weakly-supervised localization, achieving higher pixel accuracy, IoU, and Dice, and showing robust multi-instance localization and faithful visual explanations. The method yields class-specific heatmaps that align well with object regions and generalize across single- and multi-object images, making ViT explanations more reliable for localization tasks. These results imply practical impact for applications requiring faithful, end-to-end explanations and weakly-supervised object localization with ViT models.
Abstract
Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.
