Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

Tobias Leemann; Alina Fastowski; Felix Pfeiffer; Gjergji Kasneci

Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

Tobias Leemann, Alina Fastowski, Felix Pfeiffer, Gjergji Kasneci

TL;DR

Transformers challenge conventional feature attribution because they cannot be faithfully described by additive surrogates. The authors formalize this incompatibility and introduce SLALOM, a Softmax-Linked Additive Log Odds Model, which captures token value and token importance through a softmax-weighted additive log-odds formulation $F(t)=\sum_{\tau_i \in t} \alpha_i(t) v(\tau_i)$ with $\alpha_i(t)=\frac{\exp(s(\tau_i))}{\sum_{\tau_j \in t} \exp(s(\tau_j))}$ and a normalizing constraint $\sum_{\tau} s(\tau)=\gamma$. They prove that transformers can implement SLALOM, provide efficient identifiability results requiring only $2|\mathcal{V}|-1$ forward passes, and develop SLALOM-eff and SLALOM-fidel to fit SLALOM post-hoc with high fidelity or efficiency. Empirical results on synthetic linear data and real sentiment datasets (IMDB, Yelp-HAT) across BERT, DistilBERT, GPT-2, BLOOM, and Mamba demonstrate that SLALOM yields explanations with substantially higher fidelity and better alignment with human attention than conventional surrogates, while scaling to large models and even enabling black-box application to GPT-4. Overall, SLALOM provides a principled, efficient, and high-fidelity framework for explaining transformer decisions beyond additive attributions.

Abstract

We address the critical challenge of applying feature attribution methods to the transformer architecture, which dominates current applications in natural language processing and beyond. Traditional attribution methods to explainable AI (XAI) explicitly or implicitly rely on linear or additive surrogate models to quantify the impact of input features on a model's output. In this work, we formally prove an alarming incompatibility: transformers are structurally incapable of representing linear or additive surrogate models used for feature attribution, undermining the grounding of these conventional explanation methodologies. To address this discrepancy, we introduce the Softmax-Linked Additive Log Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework. SLALOM demonstrates the capacity to deliver a range of insightful explanations with both synthetic and real-world datasets. We highlight SLALOM's unique efficiency-quality curve by showing that SLALOM can produce explanations with substantially higher fidelity than competing surrogate models or provide explanations of comparable quality at a fraction of their computational costs. We release code for SLALOM as an open-source project online at https://github.com/tleemann/slalom_explanations.

Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

TL;DR

with

and a normalizing constraint

. They prove that transformers can implement SLALOM, provide efficient identifiability results requiring only

forward passes, and develop SLALOM-eff and SLALOM-fidel to fit SLALOM post-hoc with high fidelity or efficiency. Empirical results on synthetic linear data and real sentiment datasets (IMDB, Yelp-HAT) across BERT, DistilBERT, GPT-2, BLOOM, and Mamba demonstrate that SLALOM yields explanations with substantially higher fidelity and better alignment with human attention than conventional surrogates, while scaling to large models and even enabling black-box application to GPT-4. Overall, SLALOM provides a principled, efficient, and high-fidelity framework for explaining transformer decisions beyond additive attributions.

Abstract

Paper Structure (53 sections, 10 theorems, 54 equations, 14 figures, 14 tables, 2 algorithms)

This paper contains 53 sections, 10 theorems, 54 equations, 14 figures, 14 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Input and output representations
The common transformer architecture
Encoder-only and decoder-only models
Analysis
Transformers cannot represent additive models
Transformer networks with multiple layers cannot represent additive models
A Surrogate Model for Transformers
The Softmax-Linked Additive Log Odds Model
Theoretical properties of SLALOM
Numerical algorithms for computing SLALOMs
Relating SLALOM scores to linear attributions
Experimental Evaluation
...and 38 more sections

Key Result

Proposition 4.1

Let $\mathcal{V}$ be a vocabulary and $C \ge 2, C \in \mathbb{N}$ be a maximum sequence length (context length). Let $w_i: \mathcal{V} \rightarrow \mathbb{R}, \forall i \in 1,...,C$ be any map that assigns a token encountered at position $i$ a numerical score including at least one token $\tau \in

Figures (14)

Figure 1: Transformers cannot be well explained through additive models. Left: We exemplarily show the log odds for the outputs of a BERT model and a linear Naïve-Bayes model ("linear") assigning each word a weight trained on the IMDB movie review dataset. The token colors indicate the weights assigned by the linear model. We pass two sequences to the models independently and in concatenation. For the linear model, the output of the concatenated sequence can be described by the sum, but this is not the case for BERT. We show that this phenomenon is not due to a non-linearity in this particular model but stems from a general incapacity of transformers to represent additive functions. Right: To overcome this difficulty, we propose SLALOM, a novel surrogate model specifically designed to better approximate transformer models.
Figure 2: Transformer architecture. In each layer $l{=}1,\ldots,L$, input embeddings ${\bm{h}}_i^{(l-1)}$ for each token $i$ are transformed into output embeddings ${\bm{h}}_i^{(l)}$. When detaching the part prior to the classification head ("cls"), we see that the output only depends on the last embedding ${\bm{h}}_1^{(L-1)}$ and attention output ${\bm{s}}_1$.
Figure 3: Transformers fail to learn linear models. We train different models on a synthetically sampled dataset where the log odds obey a linear relation to the features. Fully connected models (2-layer ReLU networks with different hidden layer widths) capture the linear form of the relationship well despite some estimation error (a). However, common transformer models fail to model this relationship and output almost constant values (b)-(d). This does not change with more layers.
Figure 4: Verifying properties with synthetic data: SLALOM describes outputs of transformer models well (a, b). We fit SLALOM to the outputs of the BERT and GPT-2 models trained on the linear synthetic dataset. The linear and GAM models (despite having $C/2{=}15\times$ more parameters) do not match the transformer's behavior. We provide another empirical counterexample and additional quantitative results in \ref{['sec:describingtransformers']}. Verifying recovery (c, d). We verify the recovery property on a second synthetic dataset where features and labels obey a SLALOM relation. We train a 2-layer DistilBERT model on the data and fit SLALOM to the trained model. We can recover the original logit scores (c) and see a strong connection between original SLALOM parameters and the recovered ones (d). These findings verify the learnability and recovery properties. More results in \ref{['sec:recovery_ext']}.
Figure 5: Explaining a real review with SLALOM (qualitative results). SLALOM assigns two scores to each token (a,b) and can be used to compute attributions via its linearization (c). We observe that the impactful words have high importances and the value scores indicate the sign of their contribution (positive or negative words). See \ref{['fig:imdbfullscatter']} (Appendix) for fully annotated plots.
...and 9 more figures

Theorems & Definitions (15)

Proposition 4.1: Single-layer transformers cannot represent additive models.
Corollary 4.2
Corollary 4.3: Multi-Layer transformers cannot learn additive models either
Proposition 5.1: Transformers can fit SLALOM
Proposition 5.2: Recovery of SLALOMs
Proposition B.1: Proposition 4.1 in the main paper
proof
Corollary B.2: Transformers cannot represent linear models
proof
Corollary B.3: Corollary 4.3 in the main paper
...and 5 more

Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

TL;DR

Abstract

Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (15)