Table of Contents
Fetching ...

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Michael Lan, Philip Torr, Fazl Barez

TL;DR

The paper investigates mechanistic interpretability in transformer architectures by identifying shared sub-circuits that support similar sequence-continuation tasks (numerals, number words, months) in GPT-2 Small and Llama-2-7B. Using a two-stage methodology of connectivity discovery via iterative pruning and functionality discovery through attention-pattern and output-score analyses, the authors reveal a core sub-circuit comprising sequence-member detection heads and successor-prediction components. This shared circuitry generalizes across tasks and languages, including math-related prompts, suggesting reusable abstractions that underlie sequential reasoning. The findings advance understanding of how semantic concepts might be represented across models and provide a foundation for safer, targeted model editing and robustness improvements in language models.

Abstract

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

TL;DR

The paper investigates mechanistic interpretability in transformer architectures by identifying shared sub-circuits that support similar sequence-continuation tasks (numerals, number words, months) in GPT-2 Small and Llama-2-7B. Using a two-stage methodology of connectivity discovery via iterative pruning and functionality discovery through attention-pattern and output-score analyses, the authors reveal a core sub-circuit comprising sequence-member detection heads and successor-prediction components. This shared circuitry generalizes across tasks and languages, including math-related prompts, suggesting reusable abstractions that underlie sequential reasoning. The findings advance understanding of how semantic concepts might be represented across models and provide a foundation for safer, targeted model editing and robustness improvements in language models.

Abstract

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
Paper Structure (33 sections, 14 figures, 14 tables)

This paper contains 33 sections, 14 figures, 14 tables.

Figures (14)

  • Figure 1: The important components of a shared, entangled sub-circuit for the Numerals, Number Words, and Months tasks in GPT-2 Small. The functional roles of the components are labeled below them. Resid_post denotes the residual stream state right before the linear unembedding to logits. Full circuits are shown in Appendix \ref{['appendix:full_circs']}.
  • Figure 2: GPT-2 Small attention patterns for (a) Attention Head 1.5 and (b) Head 4.4. Lighter colors mean higher attention values. For each of these detection patterns, the query is shown in green, and the key is shown in blue. The Months are in sequential order, but the Numerals are not. For attention head 1.5, similar types attend to similar types. But for head 4.4, Months attend to Months, but Numerals do not attend to Numerals. For all plots, we take the mean of dataset samples to calculate the attention scores, but display only one sample on the axes for demonstration purposes.
  • Figure 3: GPT-2 Small attention pattern for Head 7.11. At the last token, head 7.11 to more recent sequence members than earlier sequence members. We also recognize an offset pattern on the diagonals of this heatmap, indicating that it also functions as a previous token head.
  • Figure 4: This GPT-2 Small attention pattern for Attention Head 9.1 shows that for the last token, the component pays strong attention to only the most recent sequence member.
  • Figure 5: The attention pattern of attention head 5.25 in Llama-2-7b resembles the attention pattern of GPT-2 Small's attention head 4.4 in Figure \ref{['fig:attnpat_early_mixed_randNumerals_Months']}(b), indicating they have similar functionality as sequence member detection heads.
  • ...and 9 more figures