An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak; Can Rager; James Dao; Yeu-Tong Lau

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau

TL;DR

This work provides concrete evidence for a memory-management phenomenon in transformers, where certain layer heads erase the residual stream directions written by earlier components, effectively reallocating bandwidth. The authors define erasure and projection ratio, identify writing and erasing heads in a GELU-4L model, and demonstrate that layer-2 erasing heads remove a large portion of early writing signals. They show that direct logit attribution (DLA) can be misleading in the presence of erasure, evidenced by a strong anti-correlation between writing and erasure contributions to logits and adversarial examples where high DLA does not translate to causal prediction. The study emphasizes caution when interpreting internal attributions and advocates for combining DLA with activation patching and context manipulation to separate direct contributions from downstream erasure effects, with implications for interpretability in small transformer models and broader architectures.

Abstract

Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 5 figures)

This paper contains 15 sections, 1 equation, 5 figures.

Introduction
Methods
Identifying writing components
Identifying erasing components
Investigating causality
Erasure as a potential confounder in DLA interpretation
Verifying DLA predictions through context manipulation
Model architecture and training
Results
Output of head L0H2 is being erased
Layer 2 attention heads are erasing L0H2
Erasure depends on writing
DLA contributions of writing and erasure are highly anti-correlated
Adversarial examples of high DLA values without direct effect
Conclusion

Figures (5)

Figure 1: The output of attention head L0H2 across the residual stream with (green) and without (red) erasure behavior. We show the median projection ratios between residual stream activations and L0H2, with and without V-composition patching. Shaded region represents 25th and 75th quantiles.
Figure 2: Median projection ratios between components in layers 1--3 and head L0H2. Error bars represent 25th and 75th quantiles.
Figure 3: Median projection ratios between selected heads in layer 2 and head L0H2, with and without V-composition patching. Error bars represent 25th and 75th quantiles.
Figure 4: Correlation between the effects of writing and erasure on the logit difference of top 2 model predictions, according to DLA.
Figure 5: Logit difference of top 2 predictions on adversarial examples, according to DLA. Patched refers to replacing the input to the head L0H2 (top) or other heads with high logit difference according to DLA (bottom) with one from a run on unrelated text with the same number of tokens (300 examples). The orange bars show median with error bars at the 25th and 75th quantiles.

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

TL;DR

Abstract

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Authors

TL;DR

Abstract

Table of Contents

Figures (5)