Table of Contents
Fetching ...

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Jayvart Sharma, Ryan Lagasse

TL;DR

The paper introduces Hybrid Attribution and Pruning (HAP), a two-stage framework that first uses Edge Attribution Patching (EAP) to rapidly identify high-potential edges and then applies Edge Pruning (EP) to extract faithful transformer circuits. By combining the speed of EAP with the precision of EP, HAP achieves at least a 46% reduction in runtime while maintaining comparable faithfulness to full EP, and outperforms EAP in accuracy. In an Indirect Object Identification (IOI) case study on GPT-2 Small, HAP preserves cooperative components such as S-inhibition heads that attribution-only methods tend to prune, demonstrating improved qualitative circuit recovery. The findings suggest that HAP enhances the scalability of mechanistic interpretability to larger models and tasks, providing a practical pathway for deeper circuit discovery with reduced computational cost.

Abstract

Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46\% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

TL;DR

The paper introduces Hybrid Attribution and Pruning (HAP), a two-stage framework that first uses Edge Attribution Patching (EAP) to rapidly identify high-potential edges and then applies Edge Pruning (EP) to extract faithful transformer circuits. By combining the speed of EAP with the precision of EP, HAP achieves at least a 46% reduction in runtime while maintaining comparable faithfulness to full EP, and outperforms EAP in accuracy. In an Indirect Object Identification (IOI) case study on GPT-2 Small, HAP preserves cooperative components such as S-inhibition heads that attribution-only methods tend to prune, demonstrating improved qualitative circuit recovery. The findings suggest that HAP enhances the scalability of mechanistic interpretability to larger models and tasks, providing a practical pathway for deeper circuit discovery with reduced computational cost.

Abstract

Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46\% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.

Paper Structure

This paper contains 16 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Recovered IOI circuits. While EAP on its own is unable to recover all S-Inhibition Heads at high sparsity, HAP preserves S-Inhibition Heads because it only uses EAP at low sparsity.
  • Figure 2: Attribution score distribution over different EAP thresholds.