Table of Contents
Fetching ...

Decomposing Attention To Find Context-Sensitive Neurons

Alex Gibson

TL;DR

The paper investigates how transformer attention encodes high-level contextual properties by decomposing first-layer attention into position-based and content-dependent components. It shows that, under a fixed token distribution, softmax denominators of certain heads are stable, enabling a calibration-text–based contextual circuit that linearly summarizes surrounding text for downstream readout. This circuit allows systematic discovery of context-sensitive neurons from model weights, demonstrated by a Commonwealth vs American English neuron, with strong alignment (median $r \approx 0.95$, $FVU \approx 0.14$) between approximate and true contributions. The approach provides a mechanistic, calibration-text–driven pathway to interpretability in transformers and suggests generalizable methods for deeper layers and varied positional schemes.

Abstract

We study transformer language models, analyzing attention heads whose attention patterns are spread out, and whose attention scores depend weakly on content. We argue that the softmax denominators of these heads are stable when the underlying token distribution is fixed. By sampling softmax denominators from a "calibration text", we can combine together the outputs of multiple such stable heads in the first layer of GPT2-Small, approximating their combined output by a linear summary of the surrounding text. This approximation enables a procedure where from the weights alone - and a single calibration text - we can uncover hundreds of first layer neurons that respond to high-level contextual properties of the surrounding text, including neurons that didn't activate on the calibration text.

Decomposing Attention To Find Context-Sensitive Neurons

TL;DR

The paper investigates how transformer attention encodes high-level contextual properties by decomposing first-layer attention into position-based and content-dependent components. It shows that, under a fixed token distribution, softmax denominators of certain heads are stable, enabling a calibration-text–based contextual circuit that linearly summarizes surrounding text for downstream readout. This circuit allows systematic discovery of context-sensitive neurons from model weights, demonstrated by a Commonwealth vs American English neuron, with strong alignment (median , ) between approximate and true contributions. The approach provides a mechanistic, calibration-text–driven pathway to interpretability in transformers and suggests generalizable methods for deeper layers and varied positional schemes.

Abstract

We study transformer language models, analyzing attention heads whose attention patterns are spread out, and whose attention scores depend weakly on content. We argue that the softmax denominators of these heads are stable when the underlying token distribution is fixed. By sampling softmax denominators from a "calibration text", we can combine together the outputs of multiple such stable heads in the first layer of GPT2-Small, approximating their combined output by a linear summary of the surrounding text. This approximation enables a procedure where from the weights alone - and a single calibration text - we can uncover hundreds of first layer neurons that respond to high-level contextual properties of the surrounding text, including neurons that didn't activate on the calibration text.

Paper Structure

This paper contains 24 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: TV distance between true and reconstructed attention patterns across sequence positions for the $6$ attention heads analyzed in this work. Results are shown for a representative text from OpenWebText. The approximation maintains low TV (typically $\sim 0.05$) across all heads shown, with similar performance observed across all tested texts.
  • Figure 2: Three types of positional kernels found across all heads in the first layer of GPT2-Small, at $n=500$, with $x_n$ corresponding to ' the'. Slowly decaying heads spread attention broadly, local heads focus on nearby positions, and uniform heads attend approximately equally across the sequence.
  • Figure 3: $\frac{\text{denom}_{h,i,\text{' the'}}}{c_{h,i}}$ plotted against $i$ for a number of test texts, for 6 of the slowly decaying heads (see \ref{['fig:positional_']}), where $c_{h,i}$ is an input-independent normalisation obtained by averaging $\text{denom}_{h,i,\text{' the'}}$ over $1000$ texts from OpenWebText.
  • Figure 4: $\text{denom}_{h,\frac{n_{\text{ctx}}}{2},\text{' the'}}$ across $1000$ different input texts drawn from OpenWebText (texts indexed across the x-axis). Shown for $6$ of the slowly decaying first-layer heads.
  • Figure 5: Visualizations of context-sensitive neurons discovered using the contextual circuit
  • ...and 2 more figures