Table of Contents
Fetching ...

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne

TL;DR

LeGrad introduces a simple, scalable explainability method for Vision Transformers by computing gradients with respect to attention maps and aggregating these layerwise signals to produce a final heatmap. It accommodates various ViT architectures and feature aggregation schemes, including attentional pooling, and is validated across segmentation, open-vocabulary localization, audio localization, and perturbation benchmarks. The results show LeGrad achieves state-of-the-art spatial fidelity and robustness, including on large models and multimodal setups, while remaining efficient enough for practical use. The work provides a practical, plug-and-play tool for transparent ViTs, with open-source code and broad applicability across vision and vision-language tasks.

Abstract

Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement tool for enhancing the transparency of ViTs. We evaluate LeGrad in challenging segmentation, perturbation, and open-vocabulary settings, showcasing its versatility compared to other SotA explainability methods demonstrating its superior spatial fidelity and robustness to perturbations. A demo and the code is available at https://github.com/WalBouss/LeGrad.

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

TL;DR

LeGrad introduces a simple, scalable explainability method for Vision Transformers by computing gradients with respect to attention maps and aggregating these layerwise signals to produce a final heatmap. It accommodates various ViT architectures and feature aggregation schemes, including attentional pooling, and is validated across segmentation, open-vocabulary localization, audio localization, and perturbation benchmarks. The results show LeGrad achieves state-of-the-art spatial fidelity and robustness, including on large models and multimodal setups, while remaining efficient enough for practical use. The work provides a practical, plug-and-play tool for transparent ViTs, with open-source code and broad applicability across vision and vision-language tasks.

Abstract

Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement tool for enhancing the transparency of ViTs. We evaluate LeGrad in challenging segmentation, perturbation, and open-vocabulary settings, showcasing its versatility compared to other SotA explainability methods demonstrating its superior spatial fidelity and robustness to perturbations. A demo and the code is available at https://github.com/WalBouss/LeGrad.
Paper Structure (30 sections, 13 equations, 18 figures, 10 tables)

This paper contains 30 sections, 13 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: LeGrad explainability maps: For a given VLM and an input textual prompt, LeGrad generates a heatmap indicating the part of the image that is most sensitive to that prompt. Examples shown for OpenCLIP ViT-B/16(150M params.) and ViT-bigG/14(2B params.).
  • Figure 2: Overview of LeGrad: Given a text prompt or a classifier $\mathcal{C}$, an activation $s^l$ is computed for each layer $l$. The activation $s^l$ is then used to compute the explainability map of that layer. The layerwise explainability maps are then merged to produce LeGrad's output.
  • Figure 3: LeGrad for a single layer.
  • Figure 4: Ablation on the number of layers used in LeGrad for different architecture sizes. (Up): AUC for Negative perturbation on ImageNet-val for different layers used for LeGrad. (Down): point-mIoU on OpenImagesV7 for different layers used for LeGrad.
  • Figure 5: Qualitative analysis of the impact of each layer for different model sizes using "a photo of a cat" as prompt. In smaller models, the explainability signal predominantly emanates from the final layers, while in larger models, lower layers also contribute.
  • ...and 13 more figures