A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Zhengxuan Wu; Atticus Geiger; Jing Huang; Aryaman Arora; Thomas Icard; Christopher Potts; Noah D. Goodman

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, Noah D. Goodman

TL;DR

The paper challenges Makelov et al.'s claims of interpretability illusions by arguing that distributed subspace interventions reveal legitimate aspects of neural representations, not spurious tricks. It formalizes the illusion concept through a two-component geometry of interventions (nullspace and rowspace) and shows that non-orthogonality between data-induced submanifolds and downstream components makes illusion-like effects inevitable in practice. Through toy examples and critique of IOI/Factual Recall experiments, the authors contend that current evaluation paradigms can misattribute causal structure and that results can reflect artifacts of training or evaluation design. Nevertheless, the work contributes by clarifying the geometry underlying distributed representations, motivating more robust metrics and broader exploration of DAS and Boundless DAS in mechanistic interpretability.

Abstract

We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

TL;DR

Abstract

Paper Structure (29 sections, 20 equations, 14 figures)

This paper contains 29 sections, 20 equations, 14 figures.

Introduction
Background: Defining aleks2023subspace's "Illusion"
Set-up
Nullspace Decomposition
The "Illusion"
Revisiting aleks2023subspace's Toy Example
Set-up
An Obvious Non-Illusory Direction
A Broader Lesson from the Example above
Remarks on aleks2023subspace's Experimental Evidence for Discovering "Illusions" in the Wild
Background: the indirect objective identification (IOI) and factual recall tasks
Indirect Objective Identification (IOI)
Factual Recall
Interchange Intervention Accuracy (IIA)
Checking Dormant or Disconnected Components via Correlational Analysis of Activations
...and 14 more sections

Figures (14)

Figure 1: An illustration of aligning a high-level causal model with key intervention locations of the streams on top of the last input token in the GPT-2 model. Besides aligning the main residual streams and the MLP activations, we align other streams to study how the name position information emerges in the GPT-2 model. The head mixing layer is a linear layer.
Figure 2: Interchange Intervention Accuracy (IIA) when aligning the name position variable with different intervention locations in the main residual streams ($v_{\text{block\_out}}$) as well as the MLP activations ($v_{\text{mlp\_act}}$) above the last token and the four tokens proceeding it. The GPT-2 model achieves 96% task accuracy. Higher IIA means better alignment. Overall our results are consistent with aleks2023subspace's findings where name position information mainly resides above the last token at the 8th layer.
Figure 3: Interchange Intervention Accuracy (IIA) when aligning the name position variable with different intervention locations above the last token.
Figure 4: Interchange Intervention Accuracy (IIA) when aligning the name position variable with head representations. The top panel shows IIA when aligning with a concatenated representation of all heads in the 8th layer by leaving one head out at a time. The bottom panel shows IIA when aligning with cumulated head representations by starting from the head resulting in the largest drop in the top panel and concatenating with one additional head at a time based on the sorted order of IIA drops from the top panel.
Figure 5: Distributions of learned DAS weights (a single dimension DAS) when aligning with the name position information at attention value output stream of the 8th layer. The number of non-zero entries maps well to the head importance discovered through our ablation studies in Section \ref{['Sec:head_distribute_reprs']} as well as findings from previous works wang2022interpretability.
...and 9 more figures

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

TL;DR

Abstract

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Authors

TL;DR

Abstract

Table of Contents

Figures (14)