An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau
TL;DR
This work provides concrete evidence for a memory-management phenomenon in transformers, where certain layer heads erase the residual stream directions written by earlier components, effectively reallocating bandwidth. The authors define erasure and projection ratio, identify writing and erasing heads in a GELU-4L model, and demonstrate that layer-2 erasing heads remove a large portion of early writing signals. They show that direct logit attribution (DLA) can be misleading in the presence of erasure, evidenced by a strong anti-correlation between writing and erasure contributions to logits and adversarial examples where high DLA does not translate to causal prediction. The study emphasizes caution when interpreting internal attributions and advocates for combining DLA with activation patching and context manipulation to separate direct contributions from downstream erasure effects, with implications for interpretability in small transformer models and broader architectures.
Abstract
Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.
