Table of Contents
Fetching ...

Attribution Patching Outperforms Automated Circuit Discovery

Aaquib Syed, Can Rager, Arthur Conmy

TL;DR

The paper tackles scalable mechanistic interpretability by introducing Edge Attribution Patching (EAP), a fast, edge-centric method to identify task-relevant subnetworks in large transformer-like models. By applying a first-order Taylor approximation, EAP estimates edge importance with two forward passes and one backward pass, enabling efficient pruning of the computational graph to recover circuits. Empirical results show EAP often achieves superior or competitive circuit recovery (higher ROC AUC) compared with Activation Patching and ACDC, while demanding far fewer forward evaluations. The authors also show that combining EAP with ACDC on the pruned subgraph can yield further improvements, suggesting a practical two-stage workflow for automated circuit discovery in large models.

Abstract

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

Attribution Patching Outperforms Automated Circuit Discovery

TL;DR

The paper tackles scalable mechanistic interpretability by introducing Edge Attribution Patching (EAP), a fast, edge-centric method to identify task-relevant subnetworks in large transformer-like models. By applying a first-order Taylor approximation, EAP estimates edge importance with two forward passes and one backward pass, enabling efficient pruning of the computational graph to recover circuits. Empirical results show EAP often achieves superior or competitive circuit recovery (higher ROC AUC) compared with Activation Patching and ACDC, while demanding far fewer forward evaluations. The authors also show that combining EAP with ACDC on the pruned subgraph can yield further improvements, suggesting a practical two-stage workflow for automated circuit discovery in large models.

Abstract

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.
Paper Structure (19 sections, 3 equations, 10 figures)

This paper contains 19 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: Edge Attribution Patching (EAP)
  • Figure 2: ROC Curves comparing EAP, ACDC with task metric, and ACDC with KL Divergence. The Docstring plot also compares to Activation Patching.
  • Figure 3: Distribution of Attribution Scores for the IOI Task (Logit Diff)
  • Figure 4: Visualizing Edge Attribution Patching.
  • Figure 5: Comparing statistics of the combined EAP + ACDC methods with EAP only. The inset shows a zoom to the significant area of the statistics of the combined method.
  • ...and 5 more figures