Table of Contents
Fetching ...

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

TL;DR

AttnLRP introduces an attention-aware Layer-Wise Relevance Propagation framework for transformers, deriving faithful rules to propagate relevance through nonlinear attention, MLPs, and normalization with a single backward pass. By leveraging Taylor-based decomposition and specialized rules (ε- and γ-LRP, along with a uniform rule for bilinear matmul and an identity rule for element-wise nonlinearities), it achieves high faithfulness and enables interaction with latent neurons. Experimental results across LLMs (e.g., LLaMa 2, Mixtral 8x7b, Flan-T5) and Vision Transformers demonstrate superior attribution quality compared to baselines, with detailed analyses on latent-feature interpretation and neuron manipulation. The work provides open-source tooling and opens pathways for concept-based explanations and safer, more transparent transformer-based systems in practical settings.

Abstract

Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a single backward pass. Through extensive evaluations against existing methods on LLaMa 2, Mixtral 8x7b, Flan-T5 and vision transformer architectures, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an LRP library at https://github.com/rachtibat/LRP-eXplains-Transformers.

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

TL;DR

AttnLRP introduces an attention-aware Layer-Wise Relevance Propagation framework for transformers, deriving faithful rules to propagate relevance through nonlinear attention, MLPs, and normalization with a single backward pass. By leveraging Taylor-based decomposition and specialized rules (ε- and γ-LRP, along with a uniform rule for bilinear matmul and an identity rule for element-wise nonlinearities), it achieves high faithfulness and enables interaction with latent neurons. Experimental results across LLMs (e.g., LLaMa 2, Mixtral 8x7b, Flan-T5) and Vision Transformers demonstrate superior attribution quality compared to baselines, with detailed analyses on latent-feature interpretation and neuron manipulation. The work provides open-source tooling and opens pathways for concept-based explanations and safer, more transparent transformer-based systems in practical settings.

Abstract

Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a single backward pass. Through extensive evaluations against existing methods on LLaMa 2, Mixtral 8x7b, Flan-T5 and vision transformer architectures, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an LRP library at https://github.com/rachtibat/LRP-eXplains-Transformers.
Paper Structure (53 sections, 88 equations, 18 figures, 9 tables)

This paper contains 53 sections, 88 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: By optimizing LRP for transformer-based architectures, our LRP variant outperforms other state-of-the-art methods in terms of explanation faithfulness and computational efficiency. We further are able to explain latent neurons inside and outside the attention module, allowing us to interact with the model. A more detailed discussion on the differences between and other LRP variants can be found in Appendix \ref{['app:difference']}. Heatmaps for other methods are illustrated in Appendix Figure \ref{['fig:vit_heatmaps']}. Legend: highly ($+$), semi- ($\circ$), not ($-$) suited. Credit: Nataba/iStock.
  • Figure 2: combined with allows to identify relevant neurons and gain insights into their encodings. This allows one to manipulate the latent representations and, e.g., to change the output "Arctic" (by disabling the corresponding neuron) to "Desert" or "Candy Store" (by activating the respective neurons). See also Section \ref{['experiments:understanding']}.
  • Figure 3: There are two approaches for understanding knowledge neurons: (a) Neuron 3948 at the last non-linearity in 17 of the Phi-1.5 model selects a weight row to add to the residual stream. This weight row projected on the vocabulary spans topics about ice, cold places and winter sport. (b) Sentences that maximally activate this neuron contain references about coldness. Attributing the neuron with highlights the most relevant tokens inside the input sentences. Inspired by voita2023neurons.
  • Figure 1.4: Comparison of four different LRP variants computed on a LLaMa 2-7b model. The given section is from the Wikipedia article on Mount Everest. The model is expected to provide the next answer token for the question 'How high did they climb in 1922? According to the text, the 1922 expedition reached 8,'. For the correctly predicted token 3 the attribution is computed. Distributing the bias uniformely on the input variables (Softmax Distribute Bias) or applying the identity rule (Softmax Identity Rule) leads to numerical instabilities. For "Softmax Distribute Bias" and "Softmax Identity Rule", we applied rules on all layers except for the softmax function. highlights the correct token the strongest, while CP-LRP focuses strongly on the start-of-sequence <s> token and exhibits more background noise e.g. irrelevant tokens such as 'Context', 'attracts', 'Everest' are highlighted, while does not highlight them or assigns negative relevance.
  • Figure 2.5: Comparison of the (ours) with the $\gamma$-rule, Grad$\times$AttnRoll chefer2021generic, AtMan deb2023atman, and SmoothGrad smilkov2017smoothgrad techniques through the perturbation experiment (faithfulness) on the ViT-B-16 using 3200 random samples of ImageNet. From left to right, the plots correspond to $f_{j}(\mathcal{X}^{F}_{LeRF})-f_{j}(\mathcal{X}^{F}_{MoRF})$ (large area is good), $f_{j}(\mathcal{X}^{F}_{MoRF})$ (steep decline is good), and $f_{j}(\mathcal{X}^{F}_{LeRF})$ (slow decline is good). "AUC" denotes the Area under Curve.
  • ...and 13 more figures