Table of Contents
Fetching ...

Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

Yi Liao, Yongsheng Gao, Weichuan Zhang

TL;DR

This work tackles the challenge of interpreting Vision Transformers by revealing the evolution of attention across intermediate blocks. It introduces DAAM, which leverages a decomposition module to extract spatial information from the [class] token and dimension-wise importance weights to compute per-block explanations, then accumulates these maps to visualize a dynamic attention flow from input to output. DAAM is validated on both supervised and self-supervised ViTs, showing superior localization and qualitative interpretability compared to CAM-based methods and existing explainability approaches. The approach enables finer analysis of internal model decisions and offers a tool for architecture design and debugging, with future work aimed at reducing reliance on the [class] token and extending to CNN-based models.

Abstract

Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.

Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

TL;DR

This work tackles the challenge of interpreting Vision Transformers by revealing the evolution of attention across intermediate blocks. It introduces DAAM, which leverages a decomposition module to extract spatial information from the [class] token and dimension-wise importance weights to compute per-block explanations, then accumulates these maps to visualize a dynamic attention flow from input to output. DAAM is validated on both supervised and self-supervised ViTs, showing superior localization and qualitative interpretability compared to CAM-based methods and existing explainability approaches. The approach enables finer analysis of internal model decisions and offers a tool for architecture design and debugging, with future work aimed at reducing reliance on the [class] token and extending to CNN-based models.

Abstract

Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.

Paper Structure

This paper contains 19 sections, 15 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: The pipeline of the proposed DAAM. The operation @ denotes the matrix multiplication. The dashed line represents an optional pipeline for self-supervised ViT models. An image is firstly fed into a ViT model to generate the decision-making [class] token. Then, the [class] token is fed into the classifier or memory bank (the dashed line) to calculate the classification score or the inner product between similarity score and the proposed dimension-wise importance weight. The importance coefficients are obtained by decomposing the classification score or the inner product. The semantic feature map extracted by the proposed decomposition module are element-wisely multiplied by the importance coefficients to form the attention map for each intermediate block. The proposed DAAM is generated by accumulating the attention maps from the first block to the final block.
  • Figure 2: The attention flows generated by the proposed DAAM for five pretrained ViT models, DeiT-small-patch16 DeiT, DeiT-tiny-patch16 DeiT, DINO-ViT-small-patch8 DINO, ViT-small ViT, and ViT-base ViT.
  • Figure 3: The attention flow generated by the proposed DAAM for DeiT-small-patch16 DeiT, DINO-ViT-small-patch8 DINO, DeiT-tiny-patch16 DeiT, ViT-small ViT, and ViT-base ViT on the same image.
  • Figure 4: The attention flows generated by the proposed DAAM for ViT models of T2T-ViT-14 T2TViT and T2T-ViT-24 T2TViT.
  • Figure 5: The attention flows generated by the proposed DAAM for two pretrained ViT models of CaiT-S24 CaiT ($2$ blocks) and XCiT-m24 XCiT ($2$ blocks).
  • ...and 6 more figures