Table of Contents
Fetching ...

Memorization: A Close Look at Books

Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes

TL;DR

This paper investigates the extent to which entire books can be memorized and extracted from Llama 3 models using a prefix-prompting method. It compares pretrained, instruction-tuned, and fine-tuned variants across a dataset derived from Project Gutenberg, assessing both autoregressive and piece-wise reconstructions. The study finds that memorization scales with book popularity and data duplication, is strongly mitigated by instruction tuning, but can be partially reversed by targeted fine-tuning, especially in early transformer layers. The results highlight a scalable framework for evaluating memorization across model versions and training regimes, with implications for copyright, privacy, and alignment, and identify data-duplication and popularity as key predictors of memory recall in LLMs.

Abstract

To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

Memorization: A Close Look at Books

TL;DR

This paper investigates the extent to which entire books can be memorized and extracted from Llama 3 models using a prefix-prompting method. It compares pretrained, instruction-tuned, and fine-tuned variants across a dataset derived from Project Gutenberg, assessing both autoregressive and piece-wise reconstructions. The study finds that memorization scales with book popularity and data duplication, is strongly mitigated by instruction tuning, but can be partially reversed by targeted fine-tuning, especially in early transformer layers. The results highlight a scalable framework for evaluating memorization across model versions and training regimes, with implications for copyright, privacy, and alignment, and identify data-duplication and popularity as key predictors of memory recall in LLMs.

Abstract

To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

Paper Structure

This paper contains 28 sections, 1 equation, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Median Jaccard similarity scores for books of varying popularity levels extracted from Llama 3.1 instruct SFT 1000 samples. Books from the pre-cutoff collection (pre) and post-cutoff collection (post) are indicated by blue and red markers, respectively.
  • Figure 2: Jaccard Similarity across books for autoregressive generation. * denotes books are added to the Gutenberg after December 2023.
  • Figure 3: Median Jaccard similarity scores for passage-wise generation on Llama3.1 70B pretrained and Llama3.1 70B Instruct models. * denotes books are added to the Gutenberg after December 2023.
  • Figure 4: Median Jaccard scores for fine-tuned models (pretrained & instruct) on different sample sizes. * denotes books are added to the Gutenberg after December 2023.
  • Figure 5: Log-log scale histogram of relative updates of individual weights in the entire network. The vast majority of updates are relatively small compared to the original Llama weights.
  • ...and 10 more figures