Table of Contents
Fetching ...

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

TL;DR

This work identifies and interprets successor heads—attention heads that increment ordinal tokens—in large language models across a wide range of architectures and scales. It reveals a shared numeric subspace encoding token indices and introduces mod-10 features discovered via sparse autoencoders, enabling vector arithmetic that can steer successor behavior. The study demonstrates both a weak universality of these mechanisms and their existence in real-world data, including interpretable polysemantic behaviors such as successorship and acronym encoding. Overall, the paper contributes to mechanistic interpretability by exposing concrete, transferable numeric representations and end-to-end circuit explanations for incrementation in frontier LLMs.

Abstract

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

TL;DR

This work identifies and interprets successor heads—attention heads that increment ordinal tokens—in large language models across a wide range of architectures and scales. It reveals a shared numeric subspace encoding token indices and introduces mod-10 features discovered via sparse autoencoders, enabling vector arithmetic that can steer successor behavior. The study demonstrates both a weak universality of these mechanisms and their existence in real-world data, including interpretable polysemantic behaviors such as successorship and acronym encoding. Overall, the paper contributes to mechanistic interpretability by exposing concrete, transferable numeric representations and end-to-end circuit explanations for incrementation in frontier LLMs.

Abstract

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.
Paper Structure (34 sections, 1 equation, 27 figures, 3 tables)

This paper contains 34 sections, 1 equation, 27 figures, 3 tables.

Figures (27)

  • Figure 1: A successor head with OV matrix takes a numbered token such as 'Monday' in embedding space and maps it to its successor value in unembedding space, e.g. 'Tuesday'. The circuit is the simple composition of the embedding matrix, the first MLP block, a single attention head, and the unembedding matrix.
  • Figure 2: Plots of successor scores (proportion of tokens where succession occurs) for each model tested. A plot of the highest successor score observed across all attention heads for each model tested (left) and successor scores of the best successor heads in models (Pythia-1.4B, GPT-2 XL, Llama-2 7B) across different tasks (right).
  • Figure 3: The activations of $t_i$'s most important feature ($y$-axis) in the SAE-decomposition of $t_j$ ($x$-axis), for $t_i, t_j$ numeric tokens. Values averaged over 100 SAE training runs.
  • Figure 4: The logit value for $t_j$ ($x$-axis) when unembedding the most important feature of $t_i$ ($y$-axis), for $t_i, t_j$ numeric tokens. Values averaged over 100 SAE training runs.
  • Figure 5: Logit distribution $W_U W_{OV}f_i$ for each mod-10 feature $f_i$.
  • ...and 22 more figures