Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Clement Neo; Shay B. Cohen; Fazl Barez

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Clement Neo, Shay B. Cohen, Fazl Barez

TL;DR

This study investigates how attention heads and next-token neurons interact in LLMs to predict new words and proposes a methodology to identify next-token neurons, find prompts that highly activate them, and determine the upstream attention heads responsible.

Abstract

Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied independently, their interactions remain largely unexplored. This study investigates how attention heads and next-token neurons interact in LLMs to predict new words. We propose a methodology to identify next-token neurons, find prompts that highly activate them, and determine the upstream attention heads responsible. We then generate and evaluate explanations for the activity of these attention heads in an automated manner. Our findings reveal that some attention heads recognize specific contexts relevant to predicting a token and activate a downstream token-predicting neuron accordingly. This mechanism provides a deeper understanding of how attention heads work with MLP neurons to perform next-token prediction. Our approach offers a foundation for further research into the intricate workings of LLMs and their impact on text generation and understanding.

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 3 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Attention Interpretability
MLP Interpretability
Automated Interpretability
Background
Relating Attention Heads and Next-Token Neurons
Methodology
Identifying Neurons
High-Activating Prompts
Individual Head Attribution
Explaining Attention Heads
Results and Discussion
Attention Heads May Capture Phrases or Context
Baseline Comparisons
...and 17 more sections

Figures (11)

Figure 1: Our approach for characterizing attention heads. For a given token-predicting neuron, we run multiple prompts through GPT-2 and Pythia to find attention heads that activate the neuron. We find that some attention heads activate the neuron only in specific contexts, and use GPT-4 to automate this discovery.
Figure 2: Illustration of our methodology. (1) Identify a token-predicting neuron, characterized by their output weights. (2) Find a set of prompts that highly activate the neuron. (3) Determine the attention heads responsible for activating the neuron during the forward pass for each prompt. (4) Generate explanations for the activity of the attention heads using GPT-4. (5) Use GPT-4 as a zero-shot classifier for test-set prompts using the explanation, based on whether the attention head would be active for that prompt. Evaluate the accuracy of classification. For an example-specific explanation of the illustration, refer to Appendix §\ref{['sec:appA']}.
Figure 3: Distribution of the set of neurons with the highest score for each token for GPT-2 Large. For each token, we find its highest scoring neuron. We find that these highest-scoring neurons tend to be in the later layers, and we take the last five layers for GPT-2 large (right of dotted line).
Figure 4: An example of prompt truncation. The activation of the " go" neuron is a minimum of 80% as compared to its activation for the original prompt. This pre-processing step automatically shortens the prompt while still making sure it activates the neuron significantly.
Figure 5: The distribution of head explanation scores for GPT-2 and Pythia variants. A negative skew indicates a rightwards-skew with respect to the mean of the distribution, and vice versa.
...and 6 more figures

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

TL;DR

Abstract

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (11)