Table of Contents
Fetching ...

Investigating the Indirect Object Identification circuit in Mamba

Danielle Ensign, Adrià Garriga-Alonso

TL;DR

The paper explores whether circuit-based mechanistic interpretability techniques generalize to the Mamba architecture by studying the Indirect Object Identification (IOI) task. It combines manual ablations and semi-automatic discovery (ACDC/EAP) to identify a bottleneck in Layer 39, demonstrates that a convolution shifts name tokens forward, and reveals linear, position-dependent name representations in the Layer-39 SSM. Positional EAP is introduced to enable token-level edge attributions, strengthening the link between recovered circuits and IOI performance. Overall, the work provides initial evidence that circuit-based interpretability tools transfer to Mamba and offers a blueprint for analyzing IOI-like tasks in state-space recurrence models.

Abstract

How well will current interpretability techniques generalize to future models? A relevant case study is Mamba, a recent recurrent architecture with scaling comparable to Transformers. We adapt pre-Mamba techniques to Mamba and partially reverse-engineer the circuit responsible for the Indirect Object Identification (IOI) task. Our techniques provide evidence that 1) Layer 39 is a key bottleneck, 2) Convolutions in layer 39 shift names one position forward, and 3) The name entities are stored linearly in Layer 39's SSM. Finally, we adapt an automatic circuit discovery tool, positional Edge Attribution Patching, to identify a Mamba IOI circuit. Our contributions provide initial evidence that circuit-based mechanistic interpretability tools work well for the Mamba architecture.

Investigating the Indirect Object Identification circuit in Mamba

TL;DR

The paper explores whether circuit-based mechanistic interpretability techniques generalize to the Mamba architecture by studying the Indirect Object Identification (IOI) task. It combines manual ablations and semi-automatic discovery (ACDC/EAP) to identify a bottleneck in Layer 39, demonstrates that a convolution shifts name tokens forward, and reveals linear, position-dependent name representations in the Layer-39 SSM. Positional EAP is introduced to enable token-level edge attributions, strengthening the link between recovered circuits and IOI performance. Overall, the work provides initial evidence that circuit-based interpretability tools transfer to Mamba and offers a blueprint for analyzing IOI-like tasks in state-space recurrence models.

Abstract

How well will current interpretability techniques generalize to future models? A relevant case study is Mamba, a recent recurrent architecture with scaling comparable to Transformers. We adapt pre-Mamba techniques to Mamba and partially reverse-engineer the circuit responsible for the Indirect Object Identification (IOI) task. Our techniques provide evidence that 1) Layer 39 is a key bottleneck, 2) Convolutions in layer 39 shift names one position forward, and 3) The name entities are stored linearly in Layer 39's SSM. Finally, we adapt an automatic circuit discovery tool, positional Edge Attribution Patching, to identify a Mamba IOI circuit. Our contributions provide initial evidence that circuit-based mechanistic interpretability tools work well for the Mamba architecture.
Paper Structure (37 sections, 7 equations, 31 figures)

This paper contains 37 sections, 7 equations, 31 figures.

Figures (31)

  • Figure 1: Our hypothesis for the role of Layer 39. The representations of n1--n3 and n4--n5 are interchangeable over positions.
  • Figure 2: A single layer in the Mamba architecture, with hook points listed in all the locations we intervene. Note that the SSM contains further hook points, described in Section \ref{['overwritename']}, "\ref{['overwritename']}". The "SSM" and "conv" components are affected by previous time steps.
  • Figure 3: Fully connected causal graph, using the additivity of the residual stream. This is an example network with 4 layers, so the output node is blocks.3.hook_resid_post. The full network we study has 48 layers, so the output node is blocks.47.hook_resid_post
  • Figure 4: Displayed is 1 - (Normalized logit diff) for each (layer, position) patch, averaged over 80 data points. 0 corresponds to acting like the uncorrupted forward pass, and 1 corresponds to acting like the corrupted forward pass. The y-axis is Layer, and the x-axis is token position. The corruptions can be observed by inspecting the token position labels. Each of the five plots correspond to different IOI patches.
  • Figure 5: Relative probability of the correct token when zero-ablating each layer's outputs. Relative probability is the softmax over the 4 logits from prompt and corruption names. The clean model gets 83%.
  • ...and 26 more figures