Table of Contents
Fetching ...

A Close Look at Decomposition-based XAI-Methods for Transformer Language Models

Leila Arras, Bruno Puri, Patrick Kahardipraja, Sebastian Lapuschkin, Wojciech Samek

TL;DR

The paper addresses the evaluation of decomposition-based XAI methods for Transformer language models by introducing a ground-truth subject–verb agreement benchmark and comparing ALTI-Logit, LRP, AttnLRP, and gradient-based explanations. It extends ALTI-Logit to Llama models and presents a fast Gradient×Input–based implementation for AttnLRP (LRPx), enabling scalable analysis. Through a 28k-sample GT dataset evaluated on BERT, GPT-2, and Llama-3, the study reveals model-dependent strengths across metrics such as MRR, RMA, PTA, and PGk, highlighting that gradient-based methods excel on PTA while decomposition-based methods show varying performance by model. The authors release the benchmark data and code, and demonstrate substantial computational speedups for LRPx, underscoring practical utility for evaluating and deploying XAI methods in real-world language modeling settings.

Abstract

Various XAI attribution methods have been recently proposed for the transformer architecture, allowing for insights into the decision-making process of large language models by assigning importance scores to input tokens and intermediate representations. One class of methods that seems very promising in this direction includes decomposition-based approaches, i.e., XAI-methods that redistribute the model's prediction logit through the network, as this value is directly related to the prediction. In the previous literature we note though that two prominent methods of this category, namely ALTI-Logit and LRP, have not yet been analyzed in juxtaposition and hence we propose to close this gap by conducting a careful quantitative evaluation w.r.t. ground truth annotations on a subject-verb agreement task, as well as various qualitative inspections, using BERT, GPT-2 and LLaMA-3 as a testbed. Along the way we compare and extend the ALTI-Logit and LRP methods, including the recently proposed AttnLRP variant, from an algorithmic and implementation perspective. We further incorporate in our benchmark two widely-used gradient-based attribution techniques. Finally, we make our carefullly constructed benchmark dataset for evaluating attributions on language models, as well as our code, publicly available in order to foster evaluation of XAI-methods on a well-defined common ground.

A Close Look at Decomposition-based XAI-Methods for Transformer Language Models

TL;DR

The paper addresses the evaluation of decomposition-based XAI methods for Transformer language models by introducing a ground-truth subject–verb agreement benchmark and comparing ALTI-Logit, LRP, AttnLRP, and gradient-based explanations. It extends ALTI-Logit to Llama models and presents a fast Gradient×Input–based implementation for AttnLRP (LRPx), enabling scalable analysis. Through a 28k-sample GT dataset evaluated on BERT, GPT-2, and Llama-3, the study reveals model-dependent strengths across metrics such as MRR, RMA, PTA, and PGk, highlighting that gradient-based methods excel on PTA while decomposition-based methods show varying performance by model. The authors release the benchmark data and code, and demonstrate substantial computational speedups for LRPx, underscoring practical utility for evaluating and deploying XAI methods in real-world language modeling settings.

Abstract

Various XAI attribution methods have been recently proposed for the transformer architecture, allowing for insights into the decision-making process of large language models by assigning importance scores to input tokens and intermediate representations. One class of methods that seems very promising in this direction includes decomposition-based approaches, i.e., XAI-methods that redistribute the model's prediction logit through the network, as this value is directly related to the prediction. In the previous literature we note though that two prominent methods of this category, namely ALTI-Logit and LRP, have not yet been analyzed in juxtaposition and hence we propose to close this gap by conducting a careful quantitative evaluation w.r.t. ground truth annotations on a subject-verb agreement task, as well as various qualitative inspections, using BERT, GPT-2 and LLaMA-3 as a testbed. Along the way we compare and extend the ALTI-Logit and LRP methods, including the recently proposed AttnLRP variant, from an algorithmic and implementation perspective. We further incorporate in our benchmark two widely-used gradient-based attribution techniques. Finally, we make our carefullly constructed benchmark dataset for evaluating attributions on language models, as well as our code, publicly available in order to foster evaluation of XAI-methods on a well-defined common ground.

Paper Structure

This paper contains 32 sections, 30 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our XAI evaluation pipeline using subject-verb agreement: 1) Predict the logits difference for the two verb forms, 2) Explain the logits difference by generating a token-level relevance heatmap for each XAI-method (for decomposition-based XAI-methods the relevances sum up to the logits difference), 3) Evaluate the heatmaps w.r.t. ground truth linguistic evidence (i.e., the verb's subject) by computing various relevance accuracy metrics (such as, e.g., the fraction of positive relevance falling inside the GT).
  • Figure 2: Exemplary heatmaps on correctly predicted samples for the Llama-3.2-1B model. The predicted verb in highlighted in green, positive relevance is mapped to red, negative to blue. The ground truth subject is underlined (in all considered samples it is the token preceding the verb).