Table of Contents
Fetching ...

Is Attention Interpretable?

Sofia Serrano, Noah A. Smith

TL;DR

Is Attention Interpretable? challenges the common assumption that attention weights faithfully explain NLP model decisions. The authors develop an erasure-based framework to compare attention-derived input importance against actual impact on predictions across multiple datasets and architectures, using JS divergence and decision flips. They find attention is a noisy predictor of input importance: it sometimes correlates with impact but often fails to identify minimal, decisive input sets, and its interpretability depends on the contextualization scope. These results suggest that attention should not be used as a sole explanation mechanism and motivate gradient- or product-based ranking approaches for explanations.

Abstract

Attention mechanisms have recently boosted performance on a range of NLP tasks. Because attention layers explicitly weight input components' representations, it is also often assumed that attention can be used to identify information that models found important (e.g., specific contextualized word tokens). We test whether that assumption holds by manipulating attention weights in already-trained text classification models and analyzing the resulting differences in their predictions. While we observe some ways in which higher attention weights correlate with greater impact on model predictions, we also find many ways in which this does not hold, i.e., where gradient-based rankings of attention weights better predict their effects than their magnitudes. We conclude that while attention noisily predicts input components' overall importance to a model, it is by no means a fail-safe indicator.

Is Attention Interpretable?

TL;DR

Is Attention Interpretable? challenges the common assumption that attention weights faithfully explain NLP model decisions. The authors develop an erasure-based framework to compare attention-derived input importance against actual impact on predictions across multiple datasets and architectures, using JS divergence and decision flips. They find attention is a noisy predictor of input importance: it sometimes correlates with impact but often fails to identify minimal, decisive input sets, and its interpretability depends on the contextualization scope. These results suggest that attention should not be used as a sole explanation mechanism and motivate gradient- or product-based ranking approaches for explanations.

Abstract

Attention mechanisms have recently boosted performance on a range of NLP tasks. Because attention layers explicitly weight input components' representations, it is also often assumed that attention can be used to identify information that models found important (e.g., specific contextualized word tokens). We test whether that assumption holds by manipulating attention weights in already-trained text classification models and analyzing the resulting differences in their predictions. While we observe some ways in which higher attention weights correlate with greater impact on model predictions, we also find many ways in which this does not hold, i.e., where gradient-based rankings of attention weights better predict their effects than their magnitudes. We conclude that while attention noisily predicts input components' overall importance to a model, it is by no means a fail-safe indicator.

Paper Structure

This paper contains 21 sections, 3 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our method for calculating the importance of representations corresponding to zeroed-out attention weights, in a hypothetical setting with four output classes .
  • Figure 2: Flat attention network (FLAN) demonstrating a convolutional encoder. Each contextualized word representation is the concatenation of two sizes of convolutions: one applied over the input representation and its two neighbors to either side, and the other applied over the input representation and its single neighbor to either side. For details, see Appendix A.1.
  • Figure 3: Difference in attention weight magnitudes versus $\Delta\mathrm{JS}$ for HANrnns, comparable to results for the other architectures; for their plots, see Appendix A.2.
  • Figure 4: These are the counts of test instances for the HANrnn models for which $i^\ast$'s JS divergence was smaller, binned by $\Delta\alpha$. These counts comprise a small fraction of the test set sizes listed in Table \ref{['dataset-stats']}.
  • Figure 5: The distribution of fractions of items removed before first decision flips on three model architectures under different ranking schemes. Boxplot whiskers represent the highest/lowest data point within 1.5 IQR of the higher/lower quartile, and dataset names at the bottom apply to their whole column. In several of the plots, the median or lower quartile aren't visible; in these cases, the median/lower quartile is either 1 or very close to 1.
  • ...and 8 more figures