Table of Contents
Fetching ...

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

TL;DR

RelP introduces Relevance Patching, replacing local gradient signals in attribution patching with Layer-wise Relevance Propagation coefficients to achieve faithful yet scalable mechanistic analysis of transformer language models. By maintaining two forward passes and one backward pass, RelP outperforms standard attribution patching in aligning with activation patching, especially for residual streams and MLPs, and matches Integrated Gradients in faithfulness for sparse feature circuits with reduced computational cost. The approach is validated across diverse models on the Indirect Object Identification task and for Subject–Verb Agreement circuits, demonstrating strong fidelity to true causal effects and practical efficiency for large-scale mechanistic studies. This work bridges feature attribution methods and mechanistic interpretability, enabling more reliable circuit discovery in state-of-the-art language models without prohibitive computation.

Abstract

Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

TL;DR

RelP introduces Relevance Patching, replacing local gradient signals in attribution patching with Layer-wise Relevance Propagation coefficients to achieve faithful yet scalable mechanistic analysis of transformer language models. By maintaining two forward passes and one backward pass, RelP outperforms standard attribution patching in aligning with activation patching, especially for residual streams and MLPs, and matches Integrated Gradients in faithfulness for sparse feature circuits with reduced computational cost. The approach is validated across diverse models on the Indirect Object Identification task and for Subject–Verb Agreement circuits, demonstrating strong fidelity to true causal effects and practical efficiency for large-scale mechanistic studies. This work bridges feature attribution methods and mechanistic interpretability, enabling more reliable circuit discovery in state-of-the-art language models without prohibitive computation.

Abstract

Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Pearson correlation coefficient (PCC) between activation patching and attribution patching (AtP) or relevance patching (RelP), computed over 100 IOI prompts for three GPT-2 model sizes (Small, Medium, Large), two Pythia models (70M, 410M), two Qwen2 models (0.5B, 7B), and Gemma2-2B. A higher value of PCC represents higher alignment with activation patching results.
  • Figure 2: Qualitative comparison showing how accurately relevance patching (RelP) and attribution patching (AtP) approximate the effects of activation patching in GPT-2 Small. RelP shows notably better alignment in the residual stream and at MLP0, where AtP's estimates are less reliable.
  • Figure 3: Faithfulness and completeness scores for circuits, evaluated on held-out data. Faint lines show individual circuits for structures from Table \ref{['tab:other_templates']}, while the bold lines indicate the average across all structures. An ideal circuit has a faithfulness score of 1 and a completeness score of 0. While Integrated Gradients (IG) requires multiple integration steps (steps=10 in this experiment), RelP achieves comparable faithfulness scores without any additional computational cost.
  • Figure 4: Qualitative comparison showing how accurately relevance patching (RelP) and attribution patching (AtP) approximate the effects of activation patching in GPT-2 Large. RelP shows notably better alignment in the residual stream and at MLP0, where AtP's estimates are less reliable.
  • Figure 5: Qualitative comparison showing how accurately relevance patching (RelP) and attribution patching (AtP) approximate the effects of activation patching in Pythia-410M. RelP shows notably better alignment in the residual stream and at MLP0, where AtP's estimates are less reliable.