LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity
Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne
TL;DR
LeGrad introduces a simple, scalable explainability method for Vision Transformers by computing gradients with respect to attention maps and aggregating these layerwise signals to produce a final heatmap. It accommodates various ViT architectures and feature aggregation schemes, including attentional pooling, and is validated across segmentation, open-vocabulary localization, audio localization, and perturbation benchmarks. The results show LeGrad achieves state-of-the-art spatial fidelity and robustness, including on large models and multimodal setups, while remaining efficient enough for practical use. The work provides a practical, plug-and-play tool for transparent ViTs, with open-source code and broad applicability across vision and vision-language tasks.
Abstract
Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement tool for enhancing the transparency of ViTs. We evaluate LeGrad in challenging segmentation, perturbation, and open-vocabulary settings, showcasing its versatility compared to other SotA explainability methods demonstrating its superior spatial fidelity and robustness to perturbations. A demo and the code is available at https://github.com/WalBouss/LeGrad.
