Table of Contents
Fetching ...

Inspecting Explainability of Transformer Models with Additional Statistical Information

Hoang C. Nguyen, Haeil Lee, Junmo Kim

TL;DR

Problem: explainability for Vision Transformers, notably Swin Transformer, is limited when using existing attribution methods. Approach: adapt Chefer et al.'s gradient-attention framework to Swin by addressing varying patch counts and introduce token-statistics-aware normalization in Layer Normalization to improve heatmap quality, using layer-wise relation matrices $R^{(i)}$ and a fusion rule $R = f(R) \cdot R^{(i)}$. Findings: on the Imagenet validation set and imagenetseg, the proposed method yields more accurate object localization and less noisy heatmaps, outperforming baseline attention-only methods and remaining competitive with Transformer Attribution on ViT. Significance: provides a practical, architecture-aware explainability approach for Swin and ViT and highlights the role of per-token statistics in interpretation.

Abstract

Transformer becomes more popular in the vision domain in recent years so there is a need for finding an effective way to interpret the Transformer model by visualizing it. In recent work, Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch. However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object. Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.

Inspecting Explainability of Transformer Models with Additional Statistical Information

TL;DR

Problem: explainability for Vision Transformers, notably Swin Transformer, is limited when using existing attribution methods. Approach: adapt Chefer et al.'s gradient-attention framework to Swin by addressing varying patch counts and introduce token-statistics-aware normalization in Layer Normalization to improve heatmap quality, using layer-wise relation matrices and a fusion rule . Findings: on the Imagenet validation set and imagenetseg, the proposed method yields more accurate object localization and less noisy heatmaps, outperforming baseline attention-only methods and remaining competitive with Transformer Attribution on ViT. Significance: provides a practical, architecture-aware explainability approach for Swin and ViT and highlights the role of per-token statistics in interpretation.

Abstract

Transformer becomes more popular in the vision domain in recent years so there is a need for finding an effective way to interpret the Transformer model by visualizing it. In recent work, Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch. However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object. Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.
Paper Structure (7 sections, 5 equations, 1 figure, 1 table)

This paper contains 7 sections, 5 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Sample visualization results. For Swin Transformer liu2021swin, Transformer attribution method shows all the attention comes to the corner. Our method with 2 layers (0 and 1) gets reasonable object location for different targets. On the other hand, ViT dosovitskiy2021image visualization, our method reduces the noise on the output heat map.