Table of Contents
Fetching ...

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

TL;DR

This paper questions whether mechanistic interpretability explanations for neural networks are uniquely determined when fixed validity criteria are imposed. It formalizes computational abstractions and analyzes two MI strategies (where-then-what and what-then-where) through exhaustive experiments on small MLPs and XOR tasks, revealing systematic non-identifiability across circuits, interpretations, algorithms, and subspaces. The findings challenge the assumption of a unique, canonical explanation and discuss pragmatic alternatives, potential resolutions via causal abstraction and inner interpretability frameworks, and the need for multi-criteria validation. The work highlights fundamental limits of identifiability in MI and calls for more rigorous, multi-faceted criteria to define trustworthy explanations for AI systems.

Abstract

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

TL;DR

This paper questions whether mechanistic interpretability explanations for neural networks are uniquely determined when fixed validity criteria are imposed. It formalizes computational abstractions and analyzes two MI strategies (where-then-what and what-then-where) through exhaustive experiments on small MLPs and XOR tasks, revealing systematic non-identifiability across circuits, interpretations, algorithms, and subspaces. The findings challenge the assumption of a unique, canonical explanation and discuss pragmatic alternatives, potential resolutions via causal abstraction and inner interpretability frameworks, and the need for multi-criteria validation. The work highlights fundamental limits of identifiability in MI and calls for more rigorous, multi-faceted criteria to define trustworthy explanations for AI systems.

Abstract

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

Paper Structure

This paper contains 31 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Illustration of the computational abstraction components within a neural network. The circuit represents a subgraph, and the mapping specifies the high-level features computed by the circuit, detailing how their values arise from low-level neural activations. Together, these form the computational abstraction (explanation of the neural network). Here, feature $F_2$ has three possible values and is defined within the 2D activation space of two neurons. Features $F_0$ and $F_1$ are binary variables, each assigned to a single neuron. $F_0$ covers the entire activation space and $F_1$ only maps specific intervals, leaving some activations unassigned.
  • Figure 2: Illustration of identifiability problems using the XOR example. We train a small MLP with two hidden layers of size 3 to compute the XOR function perfectly. The figure shows the outcome of stress-testing the two reverse-engineering strategies: Top: For the what-then-where strategy, we enumerate all subsets of neurons searching for subsets causally aligned with intermediate variables of candidate algorithms, with alignment measured by IIA. Even testing only two candidate algorithms, we find perfect implementations of both in the model. Multiple mappings (localizations) for each algorithm were identified, showing that neither the algorithm (what) nor its location in the network (where) is unique. Bottom: For the where-then-what strategy, we enumerate circuits (sub-networks) and test whether each computes the XOR independently. For each circuit, we search for possible feature interpretations of the selected neurons, identifying intermediate logic gates whose values can be mapped consistently with the neurons' activations. Consistency is defined as in \ref{['def:consistency']}. We find many different perfect circuits (the where is not unique) and for any given circuit, we find multiple valid interpretations (the what is not unique).
  • Figure 3: Number of computational abstractions found in the circuit-first approach (circuit interpretations) and the algorithm-first approach (perfect minimal mappings), as a function of architecture size $k$ (left) or of the number $n$ of gates the model is trained on (right, averaged over all target gates). One point per neural network.
  • Figure 4: The 12 most sparse circuits found in the example network.
  • Figure 5: Total number of interpretations found in the circuit-first approach and of mappings found in the algorithm-first approach, grouped by target gate.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 1: Circuit
  • Definition 2: Mapping ($\tau$)
  • Definition 3: Consistent Mapping
  • Definition 4: Circuit Error
  • Definition 5: Intervention interchange
  • Definition 6: IIA