Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed, Can Rager, Arthur Conmy
TL;DR
The paper tackles scalable mechanistic interpretability by introducing Edge Attribution Patching (EAP), a fast, edge-centric method to identify task-relevant subnetworks in large transformer-like models. By applying a first-order Taylor approximation, EAP estimates edge importance with two forward passes and one backward pass, enabling efficient pruning of the computational graph to recover circuits. Empirical results show EAP often achieves superior or competitive circuit recovery (higher ROC AUC) compared with Activation Patching and ACDC, while demanding far fewer forward evaluations. The authors also show that combining EAP with ACDC on the pruned subgraph can yield further improvements, suggesting a practical two-stage workflow for automated circuit discovery in large models.
Abstract
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.
