Table of Contents
Fetching ...

Extracting Memorized Training Data via Decomposition

Ellen Su, Anu Vellore, Amy Chang, Raffaele Mura, Blaine Nelson, Paul Kassianik, Amin Karbasi

TL;DR

This paper investigates whether frontier LLMs memorize and reveal training data. It introduces a query-based decomposition method that incrementally reconstructs memorized text by breaking prompts into smaller pieces and progressively requesting sentences. Through experiments on NYT and WSJ corpora with two commercial frontier LLMs, the method retrieves verbatim sentences from dozens of articles, illustrating a data-leak risk. The findings underscore privacy and security implications and motivate governance, safeguards, and further research into robust defenses against data-extraction attacks.

Abstract

The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.

Extracting Memorized Training Data via Decomposition

TL;DR

This paper investigates whether frontier LLMs memorize and reveal training data. It introduces a query-based decomposition method that incrementally reconstructs memorized text by breaking prompts into smaller pieces and progressively requesting sentences. Through experiments on NYT and WSJ corpora with two commercial frontier LLMs, the method retrieves verbatim sentences from dozens of articles, illustrating a data-leak risk. The findings underscore privacy and security implications and motivate governance, safeguards, and further research into robust defenses against data-extraction attacks.

Abstract

The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.
Paper Structure (45 sections, 5 figures, 6 tables)

This paper contains 45 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Reference article (left) and our LLM prompting flow to extract training data (middle) and our results (right)
  • Figure 2: Example of a SIMPLE run on LLM-
  • Figure 3: SIMPLE System Prompt for LLM-.
  • Figure 4: Distribution of EMP (blue) and BITAP-positional (green) values on articles from the final round. The red line represents the 20% threshold. Observe the skewed distribution.
  • Figure 5: Distribution of the BITAP-positional (no quotes) metric on LLM- and LLM- for NYT dataset. The green circles represent article scores on the BITAP-positional (no quote) metric. Observe the strong imbalance of the distribution. The red points are articles that contain generic quotes that the model tends generate when prompted on the topic of the article rather than the article itself.