Table of Contents
Fetching ...

Localizing Paragraph Memorization in Language Models

Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, Owen Lewis

TL;DR

<3-5 sentence high-level summary> The paper investigates paragraph-level memorization in an open-weight GPT-Neo 125M trained on the Pile, revealing that memorization signals are distributed across layers but exhibit a distinctive bias toward lower layers. It localizes the memory to a specific mechanism: an attention head in layer 1 (L1H2) that preferentially processes rare tokens, and demonstrates that memorized continuations can be altered via a contrastive objective and sparse fine-tuning. By combining gradient-based parameter attribution, activation analyses, and prefix perturbations, the work shows that memorization is both detectable and actionable, enabling unlearning or editing with minimal collateral impact on non-memorized content. The findings advance our understanding of how memory manifests in transformers and provide practical approaches for mitigating memorization risks in open-weight models, with potential implications for privacy and safety.

Abstract

Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.

Localizing Paragraph Memorization in Language Models

TL;DR

<3-5 sentence high-level summary> The paper investigates paragraph-level memorization in an open-weight GPT-Neo 125M trained on the Pile, revealing that memorization signals are distributed across layers but exhibit a distinctive bias toward lower layers. It localizes the memory to a specific mechanism: an attention head in layer 1 (L1H2) that preferentially processes rare tokens, and demonstrates that memorized continuations can be altered via a contrastive objective and sparse fine-tuning. By combining gradient-based parameter attribution, activation analyses, and prefix perturbations, the work shows that memorization is both detectable and actionable, enabling unlearning or editing with minimal collateral impact on non-memorized content. The findings advance our understanding of how memory manifests in transformers and provide practical approaches for mitigating memorization risks in open-weight models, with potential implications for privacy and safety.

Abstract

Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.
Paper Structure (30 sections, 5 equations, 12 figures, 2 tables)

This paper contains 30 sections, 5 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: We interpret language models with respect to their capability to memorize 100.0-token paragraphs from the training data. Using sets of memorized, non-memorized as well as perturbed memorized paragraphs, we study parameter and activation gradients, activation patterns as well as unlearning and editing objectives to identify an influential "memorization head".
  • Figure 2: Splitting paragraphs of the Pile into memorized paragraphs and non-memorized paragraphs based on GPT-Neo 125M. We present the model with paragraph prefixes of length 50.0 tokens, greedy decode the next 50.0 tokens and evaluate the generation in terms of negative log-likelihood (NLL) and exact match (EM).
  • Figure 3: [top] The plot shows the effect of perturbing tokens in the prefix (shown) on the model's generation (not shown) in terms of the negative log-likelihood (NLL) and exact match (EM). Changing the single token "email" into a random other token causes the EM to drop by 45.0, even though "email" is about 20.0 tokens before the generated part. [bottom] Perturbing tokens in the memorized paragraphs has, on average, less impact in exact match drop (EM) in the model's generation, than perturbing tokens in the non-memorized paragraphs.
  • Figure 4: [top and center] While memorization appears to be spread across multiple layers, we observe systemically different parameter gradients for memorized and non-memorized paragraphs. The former is associated with lower absolute gradients in lower layers of the model. [bottom] Parameter gradient attribution scores for the contrastive objective (\ref{['eq:contrast_objective']}).The value matrix ($\texttt{W\_V}$) of attention head 2 in layer 1 appears to be strongly involved.
  • Figure 5: [top] To test whether our localization also informs editing, we optimize all model parameters based on the contrastive objective (\ref{['eq:contrast_objective']}), only the 0.1% weights with the maximum gradient and a random sample of weights. Result shows that sparsely fine-tuning only the max gradient weights causes the most unlearning in MPs and the least in NMPs. [bottom] Instead of unlearning MPs, we consider an editing objective (\ref{['eq:model_editing']}) to overwrite MPs using PMPs. While sparse optimization of only the max gradient weights appears to be similarly effective as training all weights, editing is overall more difficult than unlearning.
  • ...and 7 more figures