Table of Contents
Fetching ...

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen

TL;DR

The paper investigates whether a mechanistically interpretable IOI circuit, identified in GPT-2 small, generalizes across prompt formats. By introducing DoubleIO and TripleIO prompt variants, it shows that the base IOI circuit largely preserves its components and functionality, even as the task context changes, and often outperforms the full model on these variants. A mechanism named S2 Hacking explains how, under knockout-based evaluation, the circuit can achieve high accuracy on challenging prompts, though this mechanism is not present in the base IOI setting. Further, the authors demonstrate circuit reuse across variants via path patching, revealing that all base IOI components are repurposed with added input-paths, and that name-order impacts decision points, highlighting nuanced, head-level behavior. Together, these results support a view of circuit generalization as a robust property of LLMs, with important implications for interpretability and understanding the broader capabilities of large networks.

Abstract

Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

TL;DR

The paper investigates whether a mechanistically interpretable IOI circuit, identified in GPT-2 small, generalizes across prompt formats. By introducing DoubleIO and TripleIO prompt variants, it shows that the base IOI circuit largely preserves its components and functionality, even as the task context changes, and often outperforms the full model on these variants. A mechanism named S2 Hacking explains how, under knockout-based evaluation, the circuit can achieve high accuracy on challenging prompts, though this mechanism is not present in the base IOI setting. Further, the authors demonstrate circuit reuse across variants via path patching, revealing that all base IOI components are repurposed with added input-paths, and that name-order impacts decision points, highlighting nuanced, head-level behavior. Together, these results support a view of circuit generalization as a robust property of LLMs, with important implications for interpretability and understanding the broader capabilities of large networks.

Abstract

Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.

Paper Structure

This paper contains 22 sections, 2 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Left: Different scenarios for the degree to which a circuit could change as the task format changes. Right: The IOI algorithm (top) and the result of applying that algorithm to the DoubleIO prompt variant (bottom) where the subject and indirect object tokens are both duplicated.
  • Figure 2: Deviation in attention scores from base IOI inputs to DoubleIO (left) and TripleIO (right) inputs for the base IOI circuit and full model. Nonzero values indicate deviation in behavior due to the change in prompt format. For the circuit, most heads show low deviation ($<0.1$), particularly the Name Mover heads which are responsible for returning the output. Significant differences between the circuit and model indicate that the base IOI circuit is less faithful on the prompt variants.
  • Figure 3: S2 Hacking in S-Inhibition head 8.6. Left: Attention pattern at the END position for a DoubleIO prompt. Placing all attention on the S2 token would lead to near-perfect accuracy on the task. Head 8.6 splits attention between IO2 and S2 in the full model, but in the base IOI circuit it focuses primarily on S2. Right: Knockout procedure for evaluating circuits, where paths that are not part of the circuit (marked in blue) are mean-ablated out. For head 8.6, the paths from all input tokens other than S2 are knocked out, leading to S2 Hacking.
  • Figure 4: Left: Confidence ratios for model and base IOI circuit. S2 Hacking can be seen in heads 8.6, 5.5, 5.9, and 3.0, where confidence ratio is close to 1 for the model but greater than 1 for the circuit. Right: Functional faithfulness scores for the S and IO tokens. The output is more likely to be correct if these heads predict S, so high values for the subject token (blue) indicate that the circuit is more confident than the model at predicting the correct answer.
  • Figure 5: Logit difference and normalized faithfulness for DoubleIO (left) and TripleIO (right) after adding paths to the Duplicate and Previous Token heads from different input tokens. For both variants, the faithfulness is closest to 1 (ideal) when including paths from the input tokens corresponding to duplicated names: S2 and IO2 for DoubleIO, and S2, IO2, and IO3 for TripleIO.
  • ...and 8 more figures