Table of Contents
Fetching ...

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel

TL;DR

This work critically assesses causal abstraction as a tool for mechanistic interpretability, showing that allowing arbitrarily powerful, especially non-linear, alignment maps makes any DNN perfectly align with any algorithm under mild assumptions, thereby making the approach vacuous without encoded constraints. The authors validate this non-linear representation dilemma empirically via distributed alignment search (DAS) across two tasks: hierarchical equality and indirect object identification (IOI) using Pythia models, demonstrating near-perfect interchange intervention accuracy (IIA) with non-linear maps even on randomly initialized networks. They discuss the implications for interpretability methods, highlighting the need to impose information-encoding assumptions (e.g., linear vs non-linear encoding) and to consider generalisation when learning alignment maps. The paper concludes that causal abstraction, in its unrestricted form, cannot by itself provide principled mechanistic insight, and future work should explore how representation encoding interacts with causal abstractions to yield robust interpretations.

Abstract

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

TL;DR

This work critically assesses causal abstraction as a tool for mechanistic interpretability, showing that allowing arbitrarily powerful, especially non-linear, alignment maps makes any DNN perfectly align with any algorithm under mild assumptions, thereby making the approach vacuous without encoded constraints. The authors validate this non-linear representation dilemma empirically via distributed alignment search (DAS) across two tasks: hierarchical equality and indirect object identification (IOI) using Pythia models, demonstrating near-perfect interchange intervention accuracy (IIA) with non-linear maps even on randomly initialized networks. They discuss the implications for interpretability methods, highlighting the need to impose information-encoding assumptions (e.g., linear vs non-linear encoding) and to consider generalisation when learning alignment maps. The paper concludes that causal abstraction, in its unrestricted form, cannot by itself provide principled mechanistic insight, and future work should explore how representation encoding interacts with causal abstractions to yield robust interpretations.

Abstract

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

Paper Structure

This paper contains 72 sections, 17 theorems, 89 equations, 17 figures, 1 table.

Key Result

Theorem 1

Given any algorithm ${\color{JungleGreen}\mathtt{A}}$ and any neural network ${\color{purple}\mathtt{N}}$ such that assumption:injectivityassumption:Countable_infiniteassumption:surjectivetyassumption:minimalrequirementsassumption:dnnalgsolve hold, we can show that ${\color{JungleGreen}\mathtt{A}}$

Figures (17)

  • Figure 1: A visualisation of what happens when analysing causal abstractions with increasingly complex alignment maps $\phi$. The more complex $\phi$ is, the higher the intervention accuracy---and, consequently, the stronger the algorithm--DNN alignment. In \ref{['theorem:existing_causalmap_for_any_algorithm']}, we show that given arbitrarily complex alignment maps, we can always find a perfect alignment (under reasonable assumptions).
  • Figure 2: IIA in the hierarchical equality task for causal abstractions trained with different alignment maps $\phi$. The figure shows results for all three analysed algorithms for this task. The bars represent the max IIA across 10 runs with different random seeds. The black lines represent mean IIA with 95% confidence intervals. The $\lvert {\color{purple}\boldsymbol{\psi}}^{\phi}_{\color{JungleGreen}\eta} \rvert$ denotes the intervention size per node. Without interventions, all DNNs reach almost perfect accuracy (>0.99). The used $\phi^{\mathtt{nonlin}}$ uses $L_{\mathrm{rn}}=10$ and $d_{\mathrm{rn}}=16$.
  • Figure 3: IIA of alignment between the both equality relations algorithm and an MLP, with interventions at layer 1. Left: Mean IIA over 5 seeds using $\phi^{\mathtt{nonlin}}$ ($L_{\mathrm{rn}}=1$) on the trained DNN. Performance improves with larger hidden dimension $d_{\mathrm{rn}}$ and intervention size $\lvert {\color{purple}\boldsymbol{\psi}}^{\phi}_{\color{JungleGreen}\eta} \rvert$. Right: Maximum IIA across 5 seeds using $\phi^{\mathtt{lin}}$ and $\phi^{\mathtt{nonlin}}$ with $\lvert {\color{purple}\boldsymbol{\psi}}^{\phi}_{\color{JungleGreen}\eta} \rvert=8$. Complex alignment maps achieve high IIA even with randomly initialised DNNs, while simpler maps gradually improve as training progresses.
  • Figure 4: IIA of alignment between ABAB-ABBA algorithm and Pythia language models. Left: IIA across model sizes at initialisation (Init.) or after full training (Full), with intervention at the middle layer. Right: IIA with increasingly complex alignment maps during Pythia-410m's training. Results show complex alignment maps yield near-perfect IIA. All $\phi^{\mathtt{nonlin}}$ use $d_{\mathrm{rn}}=64$.
  • Figure 5: Pseudo-code implementation of an algorithm with interventions, where interventions $\mathbf{I}_{{\color{JungleGreen}\mathtt{A}}}$ are specified as a Python dictionary mapping nodes to their intervened values.
  • ...and 12 more figures

Theorems & Definitions (44)

  • Definition 1: from Beckers2018AbstractingCM
  • Definition 2: from Beckers2018AbstractingCM
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7: inspired by Geiger2023FindingAB
  • Theorem 1
  • proof
  • Theorem 1
  • ...and 34 more