A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, Noah D. Goodman
TL;DR
The paper challenges Makelov et al.'s claims of interpretability illusions by arguing that distributed subspace interventions reveal legitimate aspects of neural representations, not spurious tricks. It formalizes the illusion concept through a two-component geometry of interventions (nullspace and rowspace) and shows that non-orthogonality between data-induced submanifolds and downstream components makes illusion-like effects inevitable in practice. Through toy examples and critique of IOI/Factual Recall experiments, the authors contend that current evaluation paradigms can misattribute causal structure and that results can reflect artifacts of training or evaluation design. Nevertheless, the work contributes by clarifying the geometry underlying distributed representations, motivating more robust metrics and broader exploration of DAS and Boundless DAS in mechanistic interpretability.
Abstract
We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.
