Table of Contents
Fetching ...

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

USVSN Sai Prashanth, Alvin Deng, Kyle O'Brien, Jyothir S, Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

TL;DR

This paper reframes LM memorization as a multifaceted phenomenon by proposing a taxonomy that splits memorized data into recitation, reconstruction, and recollection. It defines k-extractable memorization and validates the taxonomy through experiments on deduplicated Pythia models trained on The Pile, using a predictive logistic-regression framework. The study shows that different factors (corpus statistics, sequence properties, perplexity) influence memorization differently across categories, and that taxonomy-aware models outperform homogeneous baselines in predicting memorization. The findings have implications for understanding training dynamics, model safety, and data governance in large language models.

Abstract

Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

TL;DR

This paper reframes LM memorization as a multifaceted phenomenon by proposing a taxonomy that splits memorized data into recitation, reconstruction, and recollection. It defines k-extractable memorization and validates the taxonomy through experiments on deduplicated Pythia models trained on The Pile, using a predictive logistic-regression framework. The study shows that different factors (corpus statistics, sequence properties, perplexity) influence memorization differently across categories, and that taxonomy-aware models outperform homogeneous baselines in predicting memorization. The findings have implications for understanding training dynamics, model safety, and data governance in large language models.

Abstract

Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
Paper Structure (43 sections, 2 equations, 12 figures, 4 tables)

This paper contains 43 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Our intuitive memorization taxonomy has three categories determined by simple heuristics.
  • Figure 2: Histogram of various properties of interest (described in Section \ref{['sec:factors']}) for memorized and unmemorized (estimated by assuming the representative dataset's statistics hold for the Pile) samples.
  • Figure 3: KL divergence between generation perplexity of memorized and non-memorized examples for Pythia 12B with bootstrapped confidence intervals. Non-memorized samples are treated as the reference distribution. Divergence is highest for sequences with 6 duplicates, while highly duplicated sequences have near-identical memorized and unmemorized distributions.
  • Figure 4: The quantity of memorized data categorized by taxonomy across parameter size and training time. For fully trained models of varying parameter sizes, we give \ref{['fig:across_time_scale:count_scale']} total counts and \ref{['fig:across_time_scale:prop_scale']} proportion of memorized samples by category. For the 12B parameter model, we consider intermediate checkpoints during training, also providing for each checkpoint the \ref{['fig:across_time_scale:count_time']} total memorized counts and \ref{['fig:across_time_scale:prop_time']} proportion of memorized samples by category. Note that the proportional plots are truncated at 80%, as recitation is consistently a majority of the overall memorized data.
  • Figure 5: Performance of baseline, proposed taxonomy and optimally partitioned models against various metrics on subsets of test dataset. Confidence interval is standard deviation computed by bootstrapping.
  • ...and 7 more figures