Table of Contents
Fetching ...

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, Daniel Murfet

TL;DR

This work introduces refined local learning coefficients ($rLLC$) to study how data distribution structure shapes internal transformer organization during training, enabling developmental interpretability of attention heads in a two-layer attention-only model. By developing weight-refined ($wrLLC$) and data-refined ($drLLC$) variants, the authors quantify how heads differentiate by function and specialize to data types, respectively, and identify a novel cross-layer multigram circuit arising from head coordination. Empirical results show that wrLLCs align with head types and memorization of multigrams, while drLLCs reveal data-driven specialization (notably code) and a multigram-coordination mechanism across layers. The study connects distributional structure, loss landscape geometry, and learning dynamics to emergent computational structures, offering a principled toolkit for developmental interpretability and guiding future analyses of larger models across diverse data. These findings broaden our understanding of how internal neural architectures develop in response to structured data during training, with potential implications for model auditing and architectural design.

Abstract

We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these \textit{refined LLCs} (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for \textit{developmental interpretability}, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

TL;DR

This work introduces refined local learning coefficients () to study how data distribution structure shapes internal transformer organization during training, enabling developmental interpretability of attention heads in a two-layer attention-only model. By developing weight-refined () and data-refined () variants, the authors quantify how heads differentiate by function and specialize to data types, respectively, and identify a novel cross-layer multigram circuit arising from head coordination. Empirical results show that wrLLCs align with head types and memorization of multigrams, while drLLCs reveal data-driven specialization (notably code) and a multigram-coordination mechanism across layers. The study connects distributional structure, loss landscape geometry, and learning dynamics to emergent computational structures, offering a principled toolkit for developmental interpretability and guiding future analyses of larger models across diverse data. These findings broaden our understanding of how internal neural architectures develop in response to structured data during training, with potential implications for model auditing and architectural design.

Abstract

We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these \textit{refined LLCs} (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for \textit{developmental interpretability}, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.
Paper Structure (98 sections, 14 equations, 23 figures, 1 table)

This paper contains 98 sections, 14 equations, 23 figures, 1 table.

Figures (23)

  • Figure 1: The weight-refined local learning coefficient (wrLLC) measures the complexity of model components (such as attention heads) over training. At the end of training, heads with lower wrLLC can be described by simple algorithms (e.g., induction heads, bracket-matching), whereas heads with higher wrLLC memorize $n$-grams and skip $n$-grams ("multigrams"). Shown on the left are the wrLLC curves over training for layer $1$ heads, automatically clustered by $K$-means (clusters are indicated by a dominant color, within which individual heads are distinguished by shading). The clusters match the head types (middle-right, classified in \ref{['appendix:attn-heads']}). Final rLLC correlates with the number of memorized multigrams for each multigram head (far-right).
  • Figure 2: The weight-refined local learning coefficient (wrLLC) reveals how different types of attention heads differentiate during training. The wrLLC curve for each head is shown colored by its functional type (\ref{['appendix:attn-heads']}). Remarkably, the partition of the heads by type coincides with the clustering of their wrLLC curves, viewed as time series and clustered by Euclidean $K$-means (\ref{['appendix:clustering-rLLCs']}). This suggests that heads which compute differently, develop differently, as revealed by the wrLLC. Throughout this paper, developmental stages LM1--LM5 are colored in the background according to the classification of ICL1.
  • Figure 3: The data-refined local learning coefficient (drLLC) reveals how attention heads specialize to different types of data. The data-refined LLC for GitHub (middle, github) indicates that on code samples, perturbations to the weights in the multigram heads in layer $1$ have significantly less impact on the loss than perturbations to the induction heads. Informally, the drLLC suggests these heads are differentially more important for predicting code than natural language. This distinction is especially pronounced for .
  • Figure 4: Induction heads and multigram heads develop subspecializations. Compared to , the induction head is more involved in predicting induction patterns (\ref{['appendix:induction']}) that feature punctuation and special characters. Multigram heads and learn skip $n$-grams involved in bracket-matching ("Dyck patterns" \ref{['appendix:dyck']}). Blue indicates the token to be predicted. Orange indicates the strength of the attention pattern at the current token. Samples are selected by filtering for tokens where ablating the given head leads to the largest increase in loss.
  • Figure 5: Using a 1-layer (L1) model as the data distribution for data-refined LLCs helps locate skip-trigram-related structure. During stage LM3, the drLLC begins decreasing for layer $0$ multigram heads while increasing for layer $1$ heads (middle). In stage LM5, when layer $1$ drLLCs also start decreasing, the decline is significantly less pronounced than in the layer $0$ multigram heads and the rest of the heads (right).
  • ...and 18 more figures