Table of Contents
Fetching ...

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard

TL;DR

The paper develops a general theoretical foundation, causal abstraction, for mechanistic interpretability by introducing intervention algebras and a spectrum of exact and approximate transformations between causal models. It unifies a wide range of interpretability methods under a common language—interventions, translations, and alignments—while enabling graded faithfulness through approximate abstractions. Through concrete examples (tree-structured algorithms, neural networks, and infinite-variable bubble sort), it demonstrates how high-level macrovariables can faithfully reflect low-level microvariables via bijective translations and constructive abstractions. It further connects behavioral explanations (LIME, integrated gradients) and patching/scrubbing techniques to this framework, and outlines practical strategies for learning modular features and steering model behavior. The work provides a principled path toward interpretable AI by formalizing when and how high-level abstractions faithfully capture the causal structure of complex models.

Abstract

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

TL;DR

The paper develops a general theoretical foundation, causal abstraction, for mechanistic interpretability by introducing intervention algebras and a spectrum of exact and approximate transformations between causal models. It unifies a wide range of interpretability methods under a common language—interventions, translations, and alignments—while enabling graded faithfulness through approximate abstractions. Through concrete examples (tree-structured algorithms, neural networks, and infinite-variable bubble sort), it demonstrates how high-level macrovariables can faithfully reflect low-level microvariables via bijective translations and constructive abstractions. It further connects behavioral explanations (LIME, integrated gradients) and patching/scrubbing techniques to this framework, and outlines practical strategies for learning modular features and steering model behavior. The work provides a principled path toward interpretable AI by formalizing when and how high-level abstractions faithfully capture the causal structure of complex models.

Abstract

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.
Paper Structure (47 sections, 5 theorems, 63 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 47 sections, 5 theorems, 63 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma 19

For $\alpha,\beta \in A^*$, we have $\alpha \approx \beta$ iff $\mathsf{Sort}(\mathsf{Collapse}(\alpha)) = \mathsf{Sort}(\mathsf{Collapse}(\beta))$, that is, iff $\alpha$ and $\beta$ have the same normal form.

Figures (5)

  • Figure 1: A tree-structured algorithm that perfectly solves the hierarchical equality task with a compositional solution.
  • Figure 2: A fully-connected feed-forward neural network that labels inputs for the hierarchical equality task. The weights of the network are handcrafted to implement the tree-structured solution to the task.
  • Figure 3: The result of aligned interchange intervention on the low-level fully-connected neural network and a high-level tree structured algorithm under the alignment in Figure \ref{['fig:hand']}. Observe the equivalent counterfactual behavior across the two levels.
  • Figure 4: An illustration of a fully-connected neural network being transformed into a tree structured algorithm by (1) marginalizing away neurons aligned with no high-level variable, (2) merging sets of variables aligned with high-level variables, and (3) merging the continuous values of neural activity into the symbolic values of the algorithm.
  • Figure 5: Abstractions of the bubble sort causal model.

Theorems & Definitions (58)

  • Remark 1: Notation throughout the paper
  • Definition 2: Signature
  • Definition 3: Partial and Total Settings
  • Definition 4: Projection
  • Definition 5
  • Remark 6: Inducing Graphical Structure
  • Remark 7: Acyclic Model Notation
  • Definition 8: Solution Sets
  • Definition 9: Intervention
  • Definition 10: Soft Intervention
  • ...and 48 more