Table of Contents
Fetching ...

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Lin Zhang, Wenshuo Dong, Zhuoran Zhang, Shu Yang, Lijie Hu, Ninghao Liu, Pan Zhou, Di Wang

TL;DR

This work tackles the challenge of mechanistic interpretability in transformer-based language models by improving gradient-based circuit discovery. It identifies saturation along integration paths as a key flaw in prior methods like EAP and EAP-IG and proposes Edge Attribution Patching with GradPath (EAP-GP), which uses a model-aware GradPath to adaptively integrate gradients and avoid saturated regions. The approach provides a formalized GradPath-based attribution score, challenging the limitations of straight-line paths and zero-gradient issues, and demonstrates substantial gains in circuit faithfulness (up to 17.7% on six tasks) across GPT-2 variants, with competitive precision/recall against ground-truth circuits. The method enhances the reliability of identifying critical subgraphs in language models, offering a practical advancement for mechanistic interpretability and circuit analysis at scale in LMs.

Abstract

Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

TL;DR

This work tackles the challenge of mechanistic interpretability in transformer-based language models by improving gradient-based circuit discovery. It identifies saturation along integration paths as a key flaw in prior methods like EAP and EAP-IG and proposes Edge Attribution Patching with GradPath (EAP-GP), which uses a model-aware GradPath to adaptively integrate gradients and avoid saturated regions. The approach provides a formalized GradPath-based attribution score, challenging the limitations of straight-line paths and zero-gradient issues, and demonstrates substantial gains in circuit faithfulness (up to 17.7% on six tasks) across GPT-2 variants, with competitive precision/recall against ground-truth circuits. The method enhances the reliability of identifying critical subgraphs in language models, offering a practical advancement for mechanistic interpretability and circuit analysis at scale in LMs.

Abstract

Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.

Paper Structure

This paper contains 13 sections, 14 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of the gradient behavior of the loss along an integral path. EAP uses a single input $x_u$(the original activation), while EAP-IG and EAP-GP utilize blended inputs along pre-fixed straight-line paths and gradient-based adjusted paths, respectively. The points on the dashed lines represent the intermediate perturbed inputs along each path.
  • Figure 2: Illustration of the straight-line path and the dynamically adjusted path used in EAP-GP. GradPath starts at the original input $x_u$ and constructs a path in the direction of the steepest gradient descent toward the corrupted activation. The saturated area on the straight-line path is marked in red.
  • Figure 3: Faithfulness of circuits obtained using EAP-GP across different edge sparsity levels and step counts for IOI and gender-bias tasks.
  • Figure 4: Comparison of circuit performance across different methods on GPT-2 Small. In all plots, a higher value indicates better performance. EAP-GP identifies circuits that outperform other methods across all six tasks.
  • Figure 5: Precision-recall curves for IOI (left) and Greater-Than (right) node / edge overlap
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 3.1: Computational Graph hanna2024faith
  • Definition 3.2: Circuit Discovery
  • Definition 4.1: Saturation Effects and Regions