Table of Contents
Fetching ...

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li

TL;DR

The paper tackles the challenge of proving what text a large language model memorizes from its training data, particularly with copyrighted material. It introduces RECAP, an agentic pipeline that iteratively extracts memorized passages via Section Summary, Extraction, Verbatim Verification, Jailbreaking, and Feedback modules, augmented by a Memorization Score Filtering mechanism. The authors validate RECAP on the EchoTrace benchmark (over 70,000 40-token passages from 35 books and 20 arXiv papers), showing substantial gains in verbatim extraction for copyrighted and public-domain texts while largely avoiding contamination of non-training data. They provide extensive analyses across model sizes, content popularity, and iteration cost, and discuss practical and ethical considerations for deployment and reproducibility. Overall, RECAP demonstrates a robust, though resource-intensive, approach to evidencing model memorization with implications for model auditing and alignment.

Abstract

If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

TL;DR

The paper tackles the challenge of proving what text a large language model memorizes from its training data, particularly with copyrighted material. It introduces RECAP, an agentic pipeline that iteratively extracts memorized passages via Section Summary, Extraction, Verbatim Verification, Jailbreaking, and Feedback modules, augmented by a Memorization Score Filtering mechanism. The authors validate RECAP on the EchoTrace benchmark (over 70,000 40-token passages from 35 books and 20 arXiv papers), showing substantial gains in verbatim extraction for copyrighted and public-domain texts while largely avoiding contamination of non-training data. They provide extensive analyses across model sizes, content popularity, and iteration cost, and discuss practical and ethical considerations for deployment and reproducibility. Overall, RECAP demonstrates a robust, though resource-intensive, approach to evidencing model memorization with implications for model auditing and alignment.

Abstract

If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.

Paper Structure

This paper contains 54 sections, 2 equations, 27 figures, 18 tables.

Figures (27)

  • Figure 2: Our RECAP consists in a 5 step pipeline. After selecting the target content, the Section Summary Agent segments it into semantically distinct events and generates high-level summaries that will act as dynamic soft prompts. The Extraction Agent then attempts to reproduce verbatim passages for each event, with the outputs classified by the Verbatim Verifier as accepted or refused. Refusals trigger the Jailbreaker to rephrase prompts in order to overcome alignment safeguards, while accepted outputs are analyzed by the Feedback Agent, which provides structured correction hints for reattempts. This extraction-feedback loop is repeated up to five times.
  • Figure 3: The Parrot BERT is trained to intensely learn the target book, enabling it to capture memorization signals used in our hybrid score.
  • Figure 4: Larger GPT-4.1 models exhibit higher extractability of memorized content, with RECAP achieving the greatest gains in ROUGE-L.
  • Figure 5: Among the copyrighted books, we notice that titles with higher sales tend to achieve higher ROUGE-L RECAP scores with RECAP.
  • Figure 6: We notice that most improvements are achieved during the first feedback iteration, with less than 20% of the events benefiting from further rounds. Results are for DeepSeek-V3 on all EchoTrace books (Exc. Non-Training Group).
  • ...and 22 more figures