Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
Maxime Méloux, François Portet, Maxime Peyrard
TL;DR
This work reframes mechanistic interpretability as statistical inference and uses a stability-focused analysis of Edge Activation Patching with Integrated Gradients (EAP-IG) to quantify how circuit discoveries vary under data resampling, paraphrasing, hyperparameter changes, and causal-intervention noise. Across three tasks and three model families, the study finds high structural variance and sensitivity to methodological choices, challenging the robustness of single-circuit claims. It proposes best practices for MI research, including routine reporting of stability metrics and robustness checks, and argues for moving toward a probabilistic view of circuits rather than seeking a single 'true' circuit. The results highlight the need for statistically grounded rigor in MI to improve reproducibility and reliability in explanations of neural network behavior.
Abstract
The development of trustworthy artificial intelligence requires moving beyond black-box performance metrics toward an understanding of models' internal computations. Mechanistic Interpretability (MI) aims to meet this need by identifying the algorithmic mechanisms underlying model behaviors. Yet, the scientific rigor of MI critically depends on the reliability of its findings. In this work, we argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across a diverse set of models and tasks, our results demonstrate that EAP-IG exhibits high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.
