Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen
TL;DR
The paper investigates whether a mechanistically interpretable IOI circuit, identified in GPT-2 small, generalizes across prompt formats. By introducing DoubleIO and TripleIO prompt variants, it shows that the base IOI circuit largely preserves its components and functionality, even as the task context changes, and often outperforms the full model on these variants. A mechanism named S2 Hacking explains how, under knockout-based evaluation, the circuit can achieve high accuracy on challenging prompts, though this mechanism is not present in the base IOI setting. Further, the authors demonstrate circuit reuse across variants via path patching, revealing that all base IOI components are repurposed with added input-paths, and that name-order impacts decision points, highlighting nuanced, head-level behavior. Together, these results support a view of circuit generalization as a robust property of LLMs, with important implications for interpretability and understanding the broader capabilities of large networks.
Abstract
Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.
