Table of Contents
Fetching ...

Finding Transformer Circuits with Edge Pruning

Adithya Bhaskar, Alexander Wettig, Dan Friedman, Danqi Chen

TL;DR

Edge Pruning reframes circuit discovery as a gradient-based edge-pruning problem, pruning the connections between Transformer components while preserving model behavior. By employing a disentangled residual stream and L0 regularization on edge masks, it yields highly sparse circuits that remain faithful to full-model predictions, scaling to large models and datasets. The method accurately recovers ground-truth Tracr circuits and demonstrates actionable insights in a CodeLlama-13B case study, revealing overlapping mechanisms across prompting paradigms. Overall, Edge Pruning offers a practical, scalable tool for mechanistic interpretability with broad implications for understanding and auditing large language models.

Abstract

The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose *Edge Pruning* as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.

Finding Transformer Circuits with Edge Pruning

TL;DR

Edge Pruning reframes circuit discovery as a gradient-based edge-pruning problem, pruning the connections between Transformer components while preserving model behavior. By employing a disentangled residual stream and L0 regularization on edge masks, it yields highly sparse circuits that remain faithful to full-model predictions, scaling to large models and datasets. The method accurately recovers ground-truth Tracr circuits and demonstrates actionable insights in a CodeLlama-13B case study, revealing overlapping mechanisms across prompting paradigms. Overall, Edge Pruning offers a practical, scalable tool for mechanistic interpretability with broad implications for understanding and auditing large language models.

Abstract

The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose *Edge Pruning* as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.
Paper Structure (39 sections, 11 equations, 16 figures, 2 tables)

This paper contains 39 sections, 11 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Edge Pruning disentangles the residual stream and optimizes continuous masks on the read operations via gradient descent. Discretizing the masks to $\{0, 1\}$ yields the final circuit. The full model corresponds to the case where all masks equal $1$.
  • Figure 2: The faithfulness of the methods, given the KL divergence between the model and obtained circuits (lower is better). On IOI-t1 and GP, Edge Pruning is competitive at low sparsities and better at high sparsities. It outperforms both ACDC and EAP by a significant margin on IOI and GT.
  • Figure 3: Comparison of circuit performance between methods. We report the Logit Difference $\log P(\text{correct}) - \log P(\text{misleading})$ for IOI-t1, IOI and GP, and the probability difference $P(yy+1:99) - P(00:yy-1)$ for GT. Higher is better for all plots. Edge Pruning finds better-performing circuits on all four tasks. The dashed line indicates the performance of the full model.
  • Figure 4: The canonical ground-truth circuits for the Tracr-compiled xproportion and reverse programs. Edge Pruning recovers both circuits perfectly.
  • Figure 5: Our secondary metric for measuring faithfulness is the Exact Match percentage between the model and circuit predictions on IOI-t1, IOI, and GP. On GT, we use the Kendall's Tau score between the model and circuit rankings of $00, 01, \ldots, 99$ as the secondary metric. Edge Pruning is the most faithful method on all four tasks, with the difference being especially pronounced for IOI.
  • ...and 11 more figures