Table of Contents
Fetching ...

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty

Abstract

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Abstract

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions (), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
Paper Structure (36 sections, 11 figures, 4 tables, 2 algorithms)

This paper contains 36 sections, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: Finetuning increases verbatim extraction of copyrighted books. Results for Sapiens and The Handmaid's Tale illustrate the effect as finetuned models show large gains over the aligned baseline on all four memorization metrics. Values above bars denote absolute increases.
  • Figure 2: Overview of the extraction pipeline. We generate plot summaries from book excerpts (left), finetune the model to expand summaries into verbatim text (center), and evaluate memorization on held-out books at inference (right).
  • Figure 3: Memorization results for within-author (a) and cross-author (b) settings. In (a), models are finetuned and tested on books by the same author. In (b), models are finetuned on Haruki Murakami's works and tested on unseen authors. For some books Gemini-2.5-Pro numbers are relatively lower because of output filters blocking regurgitation. Complete results are in Tables \ref{['tab:in-domain-full']} and \ref{['tab:cross-domain-full']}.
  • Figure 4: Memorization results with five random training-test author pairs. For each test book, we compare models finetuned on a randomly selected training author (top row) against models finetuned on Murakami (bottom row).
  • Figure 5: Pretraining overlap, not task format, drives extraction. Finetuning on Virginia Woolf's public domain novels matches the cross-author condition, while synthetic stories yield minimal extraction. All conditions evaluated on The Handmaid's Tale.
  • ...and 6 more figures