Table of Contents
Fetching ...

Is Sparse Attention more Interpretable?

Clara Meister, Stefan Lazov, Isabelle Augenstein, Ryan Cotterell

TL;DR

The paper questions the common claim that sparse attention improves interpretability by examining whether sparsity yields faithful explanations when attention operates on internal representations rather than inputs. It introduces an entropy-based dispersion measure for input influence and evaluates LSTM and Transformer models on three text-classification tasks, observing weak links between inputs and co-indexed representations and no robust mapping from sparse attention to a small set of influential inputs. The results show that increasing sparsity tends to reduce the correlation between attention and input feature importance and does not produce sparse input explanations, suggesting sparsity may actually hinder interpretability. Overall, the findings argue against assuming sparse attention enhances interpretability and emphasize the need for concrete evidence before adopting sparsity-based explanations in NLP models.

Abstract

Sparse attention has been claimed to increase model interpretability under the assumption that it highlights influential inputs. Yet the attention distribution is typically over representations internal to the model rather than the inputs themselves, suggesting this assumption may not have merit. We build on the recent work exploring the interpretability of attention; we design a set of experiments to help us understand how sparsity affects our ability to use attention as an explainability tool. On three text classification tasks, we verify that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention and otherwise. Further, we do not find any plausible mappings from sparse attention distributions to a sparse set of influential inputs through other avenues. Rather, we observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.

Is Sparse Attention more Interpretable?

TL;DR

The paper questions the common claim that sparse attention improves interpretability by examining whether sparsity yields faithful explanations when attention operates on internal representations rather than inputs. It introduces an entropy-based dispersion measure for input influence and evaluates LSTM and Transformer models on three text-classification tasks, observing weak links between inputs and co-indexed representations and no robust mapping from sparse attention to a small set of influential inputs. The results show that increasing sparsity tends to reduce the correlation between attention and input feature importance and does not produce sparse input explanations, suggesting sparsity may actually hinder interpretability. Overall, the findings argue against assuming sparse attention enhances interpretability and emphasize the need for concrete evidence before adopting sparsity-based explanations in NLP models.

Abstract

Sparse attention has been claimed to increase model interpretability under the assumption that it highlights influential inputs. Yet the attention distribution is typically over representations internal to the model rather than the inputs themselves, suggesting this assumption may not have merit. We build on the recent work exploring the interpretability of attention; we design a set of experiments to help us understand how sparsity affects our ability to use attention as an explainability tool. On three text classification tasks, we verify that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention and otherwise. Further, we do not find any plausible mappings from sparse attention distributions to a sparse set of influential inputs through other avenues. Rather, we observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.

Paper Structure

This paper contains 20 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Correlation between the attention distribution and gradient-based FI measures. We see a notably stronger correlation between attention and FI of intermediate representation than of inputs across all models.
  • Figure 2: Entropy of gradient-based $\mathbf{g}_\hat{y}(\mathbf{x})$ and LOO $D_\hat{y}(\mathbf{x})$ FI distributions. Results are from models with full spectrum of projection functions.
  • Figure 3: Correlation between the attention distribution and input FI measures as a function of the sparsity penalty $\lambda$ used in the projection function $\phi_{\mathrm{sparseg}}$. $x$-axis is log-scaled for $\lambda < 0$ since $\lambda \in (-\infty, 1)$. Results are from the IMDb dataset.
  • Figure 4: Correlation between the attention distribution and Leave-One_Out FI measures. We see a stronger correlation between attention and intermediate representation FI than input FI across all models.