Table of Contents
Fetching ...

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu

TL;DR

This work addresses mechanistic interpretability of Transformer models by introducing a pipeline that uses Sparse Autoencoders (SAEs) and Transcoders to render OV and MLP circuits as a strictly linear computation graph, enabling exact causal attribution without resorting to linear approximations. It then pairs this representation with Hierarchical Attribution to automatically isolate task-relevant subgraphs, allowing scalable and faithful circuit discovery. Applying the method to GPT-2 Small, the authors uncover fine-grained circuits for bracket, induction, and indirect object identification tasks, linking SAE features to known head-level mechanisms while also revealing new, intermediate insights. The approach offers a principled, scalable way to dissect internal model behavior with precise attribution, though it remains focused on input-specific analysis and invites future work to extend generality and applicability.

Abstract

Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted into the model, the model's computation graph with respect to OV and MLP circuits becomes strictly linear. Our methods do not require linear approximation to compute the causal effect of each node. This fine-grained graph identifies both end-to-end and local circuits accounting for either logits or intermediate features. We can scalably apply this pipeline with a technique called Hierarchical Attribution. We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits. Our results reveal new findings underlying existing discoveries.

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

TL;DR

This work addresses mechanistic interpretability of Transformer models by introducing a pipeline that uses Sparse Autoencoders (SAEs) and Transcoders to render OV and MLP circuits as a strictly linear computation graph, enabling exact causal attribution without resorting to linear approximations. It then pairs this representation with Hierarchical Attribution to automatically isolate task-relevant subgraphs, allowing scalable and faithful circuit discovery. Applying the method to GPT-2 Small, the authors uncover fine-grained circuits for bracket, induction, and indirect object identification tasks, linking SAE features to known head-level mechanisms while also revealing new, intermediate insights. The approach offers a principled, scalable way to dissect internal model behavior with precise attribution, though it remains focused on input-specific analysis and invites future work to extend generality and applicability.

Abstract

Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted into the model, the model's computation graph with respect to OV and MLP circuits becomes strictly linear. Our methods do not require linear approximation to compute the causal effect of each node. This fine-grained graph identifies both end-to-end and local circuits accounting for either logits or intermediate features. We can scalably apply this pipeline with a technique called Hierarchical Attribution. We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits. Our results reveal new findings underlying existing discoveries.
Paper Structure (35 sections, 1 theorem, 14 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 1 theorem, 14 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

For any subgraph $G'=G[V/v]$, the node weight of the root node is the sum of the attribution scores of all leaf nodes.

Figures (7)

  • Figure 1: Overview of our method. For a given input, we (1) run forward pass once with MLP computation replaced by Trans. (2) Then a subgraph is isolated for a given input with Hierarchical Attribution in one backward. (3) We then interpret important QK attention involved in the identified circuit.
  • Figure 2: Our Hierarchical Attribution detaches unrelated nodes immediately after they receive gradient and stops their backpropagation, while standard attribution detaches nodes after the backward pass is completed. (Figure \ref{['fig:hierachical-attribution-workflow']}). We sweep the number of remaining nodes, i.e., sparsity, and compare the logit recovery, i.e., faithfulness of the identified subgraph. Experiments are conducted on 20 IOI samples (See Section \ref{['sec:ioi_circuits']}) across 30 sparsity thresholds. Results in Figure \ref{['fig:logit-recovery']} show that Hierarchical Attribution consistently outperforms standard attribution.
  • Figure 3: (a) Opening Bracket features and Closing Bracket features have positive and negative contributions to In-Bracket features respectively. (b) Closer " ["s activates the In-Bracket feature more prominently. (c) Tokens after " ["s start with strong attention to " ["s and become weaker as the sentence continues. This explains the trend in Figure \ref{['fig:bracket-L1A11421-contribution']}.
  • Figure 4: "Web"(L0M.1270 and L1M.23399) and "Web" Preceding features (L2A.14876 and L2A.17608) jointly lead to QK attention of an induction head. The "M" feature is copied to the last token for the next token prediction.
  • Figure 5: In $s_\text{John}$, the consecutive entity feature (denoted as A in Figure \ref{['fig:IOI_s_John']}) serves as the key vector for Name Mover Heads to attend to and copy the answer entity to the last token's residual stream. Such a mechanism does not work in $s_\text{Mary}$ because the correct answer is no longer a consecutive entity (i.e., the entity present after the token and). See Appendix \ref{['appendix:ioi']} for a detailed interpretation of these two examples.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 3.1: Detaching a node
  • Definition 3.2: Attribution Score
  • Theorem 3.1
  • proof