Table of Contents
Fetching ...

Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

Jérémie Dentan, Davide Buscaldi, Sonia Vanier

TL;DR

This paper analyzes verbatim memorization in LLMs by examining attention weight patterns. It critically assesses existing memorization taxonomies and presents a data-driven three-class taxonomy—Non-Memo, Guess, Recall—that aligns more closely with observed attention dynamics. A novel CNN-based methodology benchmarks taxonomy alignment on attention maps, and a custom interpretability pipeline localizes the attention regions responsible for each memorization form. The findings reveal that duplication is a necessary but not qualitatively distinct trigger, Guess relies on lower-layer syntactic cues while Recall depends on short-range high-layer interactions, and the proposed taxonomy provides a robust framework across model sizes. Together, these contributions advance understanding of memorization mechanisms and offer practical guidance for targeted mitigation and interpretability.

Abstract

Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.

Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

TL;DR

This paper analyzes verbatim memorization in LLMs by examining attention weight patterns. It critically assesses existing memorization taxonomies and presents a data-driven three-class taxonomy—Non-Memo, Guess, Recall—that aligns more closely with observed attention dynamics. A novel CNN-based methodology benchmarks taxonomy alignment on attention maps, and a custom interpretability pipeline localizes the attention regions responsible for each memorization form. The findings reveal that duplication is a necessary but not qualitatively distinct trigger, Guess relies on lower-layer syntactic cues while Recall depends on short-range high-layer interactions, and the proposed taxonomy provides a robust framework across model sizes. Together, these contributions advance understanding of memorization mechanisms and offer practical guidance for targeted mitigation and interpretability.

Abstract

Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.

Paper Structure

This paper contains 40 sections, 3 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: To evaluate a taxonomy of memorized samples, we train CNNs to classify attention weights under this taxonomy. The existing taxonomy yields poor performance. Our new, simpler taxonomy aligns much closely with the attention mechanism involved in data regurgitation.
  • Figure 2: Comparison of samples' labels between prashanth_recite_2024's taxonomy and ours. Guess class is broader than Reconstruct, including all samples where the suffix largely predictable from the prefix, which exhibit similar attention weights. We omit non-memorized samples here.
  • Figure 3: Sample attention weights and their corresponding 64-token text snippets. Labels like [Guess$|$Reconstruct] indicate the sample’s class in our taxonomy (left) and in that of prashanth_recite_2024 (right). The intensity of each color in the matrices represents the attention of a different head. The second and fourth samples exhibit similar patterns in lower-layer attention and are both classified as Guess in our taxonomy, though assigned to different classes by prashanth_recite_2024.
  • Figure 4: We parametrize taxonomies as decision trees with two types of nodes. We omit the Non-Memorized node at the root of each taxonomy, because memorized samples are always defined as 32-extractable sequences.
  • Figure 5: Confusion matrix for three taxonomies: prashanth_recite_2024 (left), ours (middle), and the best 4-classes taxonomy (right, see Table \ref{['tab:taxonomy_results']}). Datasets are balanced, with $144{,}000$ attention weights in each class.
  • ...and 13 more figures