The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel
TL;DR
This work critically assesses causal abstraction as a tool for mechanistic interpretability, showing that allowing arbitrarily powerful, especially non-linear, alignment maps makes any DNN perfectly align with any algorithm under mild assumptions, thereby making the approach vacuous without encoded constraints. The authors validate this non-linear representation dilemma empirically via distributed alignment search (DAS) across two tasks: hierarchical equality and indirect object identification (IOI) using Pythia models, demonstrating near-perfect interchange intervention accuracy (IIA) with non-linear maps even on randomly initialized networks. They discuss the implications for interpretability methods, highlighting the need to impose information-encoding assumptions (e.g., linear vs non-linear encoding) and to consider generalisation when learning alignment maps. The paper concludes that causal abstraction, in its unrestricted form, cannot by itself provide principled mechanistic insight, and future work should explore how representation encoding interacts with causal abstractions to yield robust interpretations.
Abstract
The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.
