Table of Contents
Fetching ...

Attention Meets Post-hoc Interpretability: A Mathematical Perspective

Gianluigi Lopardo, Frederic Precioso, Damien Garreau

TL;DR

The paper tackles whether attention weights in Transformer-like architectures provide faithful explanations or if post-hoc explainers offer better insight. It analyzes a simple single-layer, multi-head attention classifier for binary sentiment classification and derives explicit forms for $\nabla_{e_t} f(x)$ and $\beta^\infty_j$ that relate to the attention weights $\alpha_t^{(i)}$ and to the linear layers $W_\ell^{(i)}, W_v^{(i)}, W_k^{(i)}$. The main finding is that gradient-based and LIME explanations encode substantial information from the forward path beyond what attention weights reveal, with LIME producing an affine-like transformation of attention under certain conditions. The results clarify limitations of attention as explanations, emphasize the value of post-hoc methods, and outline future work to extend the analysis to deeper architectures and other domains.

Abstract

Attention-based architectures, in particular transformers, are at the heart of a technological revolution. Interestingly, in addition to helping obtain state-of-the-art results on a wide range of applications, the attention mechanism intrinsically provides meaningful insights on the internal behavior of the model. Can these insights be used as explanations? Debate rages on. In this paper, we mathematically study a simple attention-based architecture and pinpoint the differences between post-hoc and attention-based explanations. We show that they provide quite different results, and that, despite their limitations, post-hoc methods are capable of capturing more useful insights than merely examining the attention weights.

Attention Meets Post-hoc Interpretability: A Mathematical Perspective

TL;DR

The paper tackles whether attention weights in Transformer-like architectures provide faithful explanations or if post-hoc explainers offer better insight. It analyzes a simple single-layer, multi-head attention classifier for binary sentiment classification and derives explicit forms for and that relate to the attention weights and to the linear layers . The main finding is that gradient-based and LIME explanations encode substantial information from the forward path beyond what attention weights reveal, with LIME producing an affine-like transformation of attention under certain conditions. The results clarify limitations of attention as explanations, emphasize the value of post-hoc methods, and outline future work to extend the analysis to deeper architectures and other domains.

Abstract

Attention-based architectures, in particular transformers, are at the heart of a technological revolution. Interestingly, in addition to helping obtain state-of-the-art results on a wide range of applications, the attention mechanism intrinsically provides meaningful insights on the internal behavior of the model. Can these insights be used as explanations? Debate rages on. In this paper, we mathematically study a simple attention-based architecture and pinpoint the differences between post-hoc and attention-based explanations. We show that they provide quite different results, and that, despite their limitations, post-hoc methods are capable of capturing more useful insights than merely examining the attention weights.
Paper Structure (43 sections, 8 theorems, 70 equations, 6 figures)

This paper contains 43 sections, 8 theorems, 70 equations, 6 figures.

Key Result

Theorem 4.1

The gradient of the model $f$ defined by Eq. eq:def-model, with respect to the embedded token $e_t$, $t \in [T]$, is

Figures (6)

  • Figure 1: Different explainers can produce very different explanations. Here, the attention mean ($\alpha$-avg) and maximum ($\alpha$-max) over the heads, LIME (lime), the gradient mean (G-avg), $L^1$ norm (G-l1), and $L^2$ norm (G-l2), with respect to the tokens, and Gradient times Input (G$\times$I) are employed to interpret the prediction of a sentiment-analysis model. Words with positive (respectively, negative) weights are highlighted in green (respectively, red), with intensity proportional to their weight. In the example, all the explainers identify the word questionable as highly significant, while only lime, and G$\times$I highlight a negative contribution. Interestingly, $\alpha$-avg and $\alpha$-max identify the word popular as the most important word in absolute terms, in disagreement with the all others.
  • Figure 2: Illustration of the architecture of the model defined in Section \ref{['sec:the-model']}. The input text, denoted as $x \in [D]^T$, is transformed into an embedding $e \in \mathbb{R}^{T \times d_e}$ by summing word embeddings and positional encodings as in Eq. \ref{['eq:def-embedding']}. For each of the $K$ heads, the key $k \in\mathbb{R}^{T \times d_\text{att}}$, query $q \in\mathbb{R}^{T \times d_\text{att}}$, and value $v \in\mathbb{R}^{T \times d_\text{out}}$ matrices are computed by applying linear transformations to $e$ using $W_k,W_q \in\mathbb{R}^{d_\text{att}\times d_e}$, and $W_v\in\mathbb{R}^{d_\text{out}\times d_e}$, respectively. The attention weights $\alpha \in \mathbb{R}^T$ are then computed as the softmax of the scaled dot-product of $k$ and $q$, as per Eq. \ref{['eq:def-attention']}. Then the intermediary output $\tilde{v} \in \mathbb{R}^{d_\text{out}}$ is computed are the average of the values $v$ weighted by the attention $\alpha$. Each head outputs the linear transformation $W_{\ell} \in \mathbb{R}^{1\times d_\text{out}}$ of the $\tilde{v}$ associated with the query corresponding to the [CLS] token. The final prediction $f(x)$ of the model is the average of the outputs across all heads.
  • Figure 3: Attention matrices across the heads. Each head is represented by a distinct matrix, demonstrating the unique focus each head has on different parts of the document. The matrices illustrate that tokens within the document can carry significantly different weights, indicating the varying importance or relevance of each token in the context of the document. The aggregation of these weights to provide token-level scores is a critical aspect. Note that Eqs. \ref{['eq:attn-avg']} and \ref{['eq:attn-max']} correspond to the average and the maximum values, respectively, of the first row across all six matrices.
  • Figure 4: Illustration of the accuracy of Eq. \ref{['eq:lime-meets-attention']}. The boxplots show the results from $5$ runs of LIME with default parameters, while the red crosses indicate the predictions given by Theorem \ref{['th:lime-meets-attention']}. The document $\xi$ contains $T=99$ tokens and $d=71$ distinct words and is classified as a negative review. Note that Theorem \ref{['th:lime-meets-attention']} holds true even with $T\neq d$ for reasonable word multiplicities, as discussed in Section \ref{['sec:limitation']}.
  • Figure 5: Illustration of the accuracy of Theorem \ref{['th:gradient-meets-attention']}. Here, for illustrative purpose, $d_e=80$.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Theorem 4.1: Gradient meets attention
  • Theorem 5.1: LIME meets attention
  • Proposition 2.1: Approximated conditional expectation
  • Lemma 4.1: Expected ratio
  • proof
  • Lemma 4.2: Integral approximation
  • proof
  • Lemma 4.3: Conditional variance computation
  • proof
  • Lemma 4.4: Exact expressions
  • ...and 3 more