Table of Contents
Fetching ...

Finding Highly Interpretable Prompt-Specific Circuits in Language Models

Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella

TL;DR

Together, the results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.

Abstract

Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco & Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose a representative circuit for each family as a practical unit of analysis. Finally, we develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features and assemble mechanistic explanations for prompt families behavior. Together, our results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.

Finding Highly Interpretable Prompt-Specific Circuits in Language Models

TL;DR

Together, the results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.

Abstract

Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco & Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose a representative circuit for each family as a practical unit of analysis. Finally, we develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features and assemble mechanistic explanations for prompt families behavior. Together, our results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.
Paper Structure (65 sections, 75 equations, 31 figures, 5 tables)

This paper contains 65 sections, 75 equations, 31 figures, 5 tables.

Figures (31)

  • Figure 1: Average linkage clustering of prompt-level traces exposes distinct circuit families rather than a single universal IOI circuit. The top annotation bar indicates high-level templates (ABBA vs. BABA), while the left bar indicates low-level templates (see Appendix \ref{['app:clustering-dataset']} for color code). Circuits are represented as sets of edge--singular-vector pairs.
  • Figure 2: Signal similarity uncovers common and distinct functionality across prompts. Comparison of signal similarity matrices between representative circuits across three different models, organized by column: GPT-2 Small (left), Pythia-160M (middle), and Gemma-2 2B (right). Representatives for GPT-2 Small are from ABBA (x-axis) and BABA (y-axis). Representatives for Pythia-160M are from Template 10 (, x-axis) and Template 9 (, y-axis). Representatives for Gemma-2 2B are from Template 15 (, x-axis) and Template 14 (, y-axis). Numbers in parentheses (e.g. (1), (2)...) indicate the $n$-th occurrence of the token in the prompt.
  • Figure 3: Traces are interpretable and expose algorithmic differences in circuits between ABBA and BABA in GPT-2. AH node labels are (destination token, source token); edge labels are automatically generated. Red: feeds logits, purple: low-level, orange: provide inhibition signal. Dark green nodes show that, in BABA only, the circuit relies on identifying the "second item in a parallel pair", ie, "Kelly", as the appropriate output.
  • Figure 4: Condition numbers of $W_Q$ (left) and $W_K^{\top}$ (right) from GPT-2 small.
  • Figure 5: Condition numbers of $W_Q$ (left) and $W_K^{\top}$ (right) from Pythia-160M.
  • ...and 26 more figures

Theorems & Definitions (1)

  • proof