Table of Contents
Fetching ...

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Maxime Méloux, François Portet, Maxime Peyrard

TL;DR

This work reframes mechanistic interpretability as statistical inference and uses a stability-focused analysis of Edge Activation Patching with Integrated Gradients (EAP-IG) to quantify how circuit discoveries vary under data resampling, paraphrasing, hyperparameter changes, and causal-intervention noise. Across three tasks and three model families, the study finds high structural variance and sensitivity to methodological choices, challenging the robustness of single-circuit claims. It proposes best practices for MI research, including routine reporting of stability metrics and robustness checks, and argues for moving toward a probabilistic view of circuits rather than seeking a single 'true' circuit. The results highlight the need for statistically grounded rigor in MI to improve reproducibility and reliability in explanations of neural network behavior.

Abstract

The development of trustworthy artificial intelligence requires moving beyond black-box performance metrics toward an understanding of models' internal computations. Mechanistic Interpretability (MI) aims to meet this need by identifying the algorithmic mechanisms underlying model behaviors. Yet, the scientific rigor of MI critically depends on the reliability of its findings. In this work, we argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across a diverse set of models and tasks, our results demonstrate that EAP-IG exhibits high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

TL;DR

This work reframes mechanistic interpretability as statistical inference and uses a stability-focused analysis of Edge Activation Patching with Integrated Gradients (EAP-IG) to quantify how circuit discoveries vary under data resampling, paraphrasing, hyperparameter changes, and causal-intervention noise. Across three tasks and three model families, the study finds high structural variance and sensitivity to methodological choices, challenging the robustness of single-circuit claims. It proposes best practices for MI research, including routine reporting of stability metrics and robustness checks, and argues for moving toward a probabilistic view of circuits rather than seeking a single 'true' circuit. The results highlight the need for statistically grounded rigor in MI to improve reproducibility and reliability in explanations of neural network behavior.

Abstract

The development of trustworthy artificial intelligence requires moving beyond black-box performance metrics toward an understanding of models' internal computations. Mechanistic Interpretability (MI) aims to meet this need by identifying the algorithmic mechanisms underlying model behaviors. Yet, the scientific rigor of MI critically depends on the reliability of its findings. In this work, we argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across a diverse set of models and tasks, our results demonstrate that EAP-IG exhibits high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.

Paper Structure

This paper contains 16 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: In gpt2-small, varying multiple circuit finding parameters at once (resampling strategy, aggregation method, type of intervention, EAP method, and pruning strategy) yields many different circuits, which we display along with the union and median circuit (left). In the center, the MDS projection of the pairwise Jaccard index matrix shows that none of the tested EAP methods consistently yields circuits with lower variance (tighter clustering).
  • Figure 2: Circuit error and pairwise Jaccard index of EAP-IG circuits found across the three models, tasks, and types of perturbation. One point represents one circuit.
  • Figure 3: Average and standard deviation of the circuit error (left) and pairwise Jaccard (right) index of the circuits found in gpt2-small when using noise with amplitude [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5] as intervention.
  • Figure 4: Full heatmap of the pairwise Jaccard index between circuits displayed in Figure \ref{['fig:main_mds']} (circuits found in gpt2-small on the Greater-Than task while varying all parameters)
  • Figure 5: CV of circuit metrics for different noise amplitudes in gpt2-small, averaged across tasks.