Table of Contents
Fetching ...

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Fred Zhang, Neel Nanda

TL;DR

<3-5 sentence high-level summary> Activation patching is a central tool for mechanistic interpretability, but prior work shows results can be highly sensitive to methodological choices. This paper systematically investigates how corruption methods (Gaussian noise vs symmetric token replacement), evaluation metrics (probability, logit difference, KL divergence), and sliding-window patching affect localization and circuit discovery in language models. It finds that GN and STR can produce divergent localization patterns and that metric choice can either obscure or reveal negative components, with sliding-window patching amplifying joint effects. Based on these findings, the authors propose practical recommendations—favor STR for in-distribution perturbations, use logit-difference as the primary metric, and apply sliding-window patching judiciously—to improve the reliability and interpretability of activation-patching analyses.

Abstract

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

TL;DR

<3-5 sentence high-level summary> Activation patching is a central tool for mechanistic interpretability, but prior work shows results can be highly sensitive to methodological choices. This paper systematically investigates how corruption methods (Gaussian noise vs symmetric token replacement), evaluation metrics (probability, logit difference, KL divergence), and sliding-window patching affect localization and circuit discovery in language models. It finds that GN and STR can produce divergent localization patterns and that metric choice can either obscure or reveal negative components, with sliding-window patching amplifying joint effects. Based on these findings, the authors propose practical recommendations—favor STR for in-distribution perturbations, use logit-difference as the primary metric, and apply sliding-window patching judiciously—to improve the reliability and interpretability of activation-patching analyses.

Abstract

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.
Paper Structure (67 sections, 4 equations, 31 figures, 5 tables)

This paper contains 67 sections, 4 equations, 31 figures, 5 tables.

Figures (31)

  • Figure 1: The workflow of activation patching for localization: run the intervention procedure (a) on every relevant component, such as all the attention heads, and plot the effects (b).
  • Figure 2: Disparate MLP patching effects for factual recall in GPT-2 XL. (a) We patch MLP activations at the last subject token. (b)(c) The patching effects using different corruption methods with a window size of $5$. STR suggests much a weaker peak, regardless of the evaluation metric.
  • Figure 3: Attention of the Name Movers from the last token, in corrupted and patched runs.
  • Figure 4: Activation patching on MLP across layers and token positions in GPT-2 XL, with a sliding window patching of size $5$. Note that probability (b) highlights the importance of the last subject token, whereas logit difference (a) displays less effects.
  • Figure 5: Sliding window patching vs summing up individual patching effects; patching MLP activation at the last subject token in GPT-2 XL on factual recall prompts. Sliding window patching offers $1.40$×, $1.75$× and $1.59$× peak value than summation of single-layer patchings. Single-layer patching (a) suggests a weak peak.
  • ...and 26 more figures