Table of Contents
Fetching ...

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov

TL;DR

This work addresses the lack of unity in interpretability by reframing mechanistic interpretability through causal mediation analysis. It proposes a taxonomy of mediator types (from neurons and heads to non-basis-aligned subspaces) and maps how search methods (exhaustive, supervised, unsupervised, alignment) interact with these mediators. The paper introduces evaluation criteria (sparsity, generality, selectivity, faithfulness) aligned with three goals (explaining, verifying hypotheses, localization/editing) and offers actionable future directions, including discovering new mediators and standardized benchmarks. Overall, it provides a structured, causality-grounded framework to compare MI studies, guiding method selection based on research objectives and enabling principled progress in understanding neural network computations.

Abstract

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

TL;DR

This work addresses the lack of unity in interpretability by reframing mechanistic interpretability through causal mediation analysis. It proposes a taxonomy of mediator types (from neurons and heads to non-basis-aligned subspaces) and maps how search methods (exhaustive, supervised, unsupervised, alignment) interact with these mediators. The paper introduces evaluation criteria (sparsity, generality, selectivity, faithfulness) aligned with three goals (explaining, verifying hypotheses, localization/editing) and offers actionable future directions, including discovering new mediators and standardized benchmarks. Overall, it provides a structured, causality-grounded framework to compare MI studies, guiding method selection based on research objectives and enabling principled progress in understanding neural network computations.

Abstract

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.
Paper Structure (52 sections, 3 equations, 5 figures, 3 tables)

This paper contains 52 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Outline of survey. We first define necessary causal terminology (§\ref{['sec:preliminaries']}) and contextualize our perspective with others' (§\ref{['sec:related_work']}). We then give an overview of the history of mechanistic interpretability centered on units of causal analysis (§\ref{['sec:history']}). We then survey and categorize commonly used units of analysis and describe their strengths and weaknesses (§\ref{['sec:mediator_type']}), as well as methods for searching over them (§\ref{['sec:mediator_search']}). Finally, we discuss (§\ref{['sec:discussion']}) what we consider to be among the most important questions in mechanistic interpretability: What are the right causal abstractions for understanding and discussing the inner workings of NNs (§\ref{['sec:right_mediator']})? What kinds of mediators and research will be needed to advance the field (§\ref{['sec:future_work']})?
  • Figure 2: Visual summary of causal mediation analysis. Given input $X=\color{blue}{x}$ and the resulting output (model prediction) $Y=\color{blue}{y}$, and another input $X=\color{red}{x'}$ that results in different output $Y=\color{red}{y'}$, we can compute the total effect of changing $\color{blue}{x}$ to $\color{red}{x'}$ as $\color{red}{y'} \color{black}{-} \color{blue}{y}$. In neural networks, there exist components $Z$ that mediate the influence of $X$ on $Y$. A common way to quantify the importance of $Z$ is by measuring its indirect effect (Eq. \ref{['eq:ie']}), where, given $X=\color{blue}{x}$, one sets $Z$ to some counterfactual value $\color{red}{z'}$. In this figure, we set $Z$ to what it would have been given $\color{red}{x'}$; this results in $Y=\color{darkgray}{\tilde{y}}$. One can then measure the indirect effect as $\color{darkgray}{\tilde{y}}\color{black}{-}\color{blue}{y}$.
  • Figure 3: Visualization of common mediator types in neural networks. Neurons or attention heads are common units of analysis. Full layer and submodule vectors are more coarse-grained, but more easily enumerable. One can also implicate a multidimensional subspace, which could be neuron-basis-aligned (as in a group of neurons, pictured here) or non-basis-aligned. Non-basis-aligned mediators---e.g., arbitrary directions in activation space---have recently become a popular mediator type due to their monosemanticity. However, discovering non-basis-aligned mediators requires external modules such as classifiers, autoencoders, or other modifications to the original computation graph. Note that while this figure depicts a Transformer, many of the mediator types generalize to other architectures (the primary exception being attention heads).
  • Figure 4: Neurons are not guaranteed to encode interpretable features. If non-basis-aligned directions encode the features of interest, then a neuron may activate on many different features that are non-orthogonal to its basis. Locating non-basis-aligned mediators requires components in addition to the model's computation graph that encode the coefficients on each activation. One can, for example, obtain these coefficients via supervised optimization with probing classifiers (§\ref{['sssec:supervised']}) or unsupervised optimization with sparse autoencoders (§\ref{['sssec:unsupervised']}). Note that optimization-based techniques sometimes introduce non-linearities, meaning that the discovered directions will not necessarily be a subspace of activation space.
  • Figure 5: Example of alignment search, based on an example from mueller2025mibmechanisticinterpretabilitybenchmark. (a) We start with the computation graph $\mathcal{C}$, and a hypothesized high-level causal graph $\mathcal{H}$. The hypothesis is that the model accomplishes addition using a tens-place addition, a ones-place addition, and a carry-the-one variable. (b) We hypothesize that the carry-the-one variable exists in layer two ($\mathbf{h}^2$). This variable may exist between multiple neurons, so interventions to neurons will not suffice. (c) We learn a rotation $\mathbf{R}$ into a new space where the target variable is aligned to the basis. This allows us to perform an intervention (the do-operation) to change the carry-the-one variable to some counterfactual value. (d) After intervening, we rotate back out using $\mathbf{R}^{-1}$. If the hypothesized causal graph is correct, the new output should be 11 instead of 21 after changing the carry-the-one variable's value.