Table of Contents
Fetching ...

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Michael Aerni, Javier Rando, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramèr

TL;DR

This work investigates an intermediate regime of memorization that is called non-adversarial reproduction, where the overlap between model responses and pretraining data when responding to natural and benign prompts is quantified.

Abstract

Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses -- even for benign interactions.

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

TL;DR

This work investigates an intermediate regime of memorization that is called non-adversarial reproduction, where the overlap between model responses and pretraining data when responding to natural and benign prompts is quantified.

Abstract

Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses -- even for benign interactions.

Paper Structure

This paper contains 58 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: LLMs often output text that overlaps with snippets of their training data when responding to benign prompts. Red text indicates snippets that were found verbatim on the Web.
  • Figure 2: LLMs reproduce training data for natural prompts. We define reproduced strings as text found verbatim on the Internet. For every LLM generation, we measure the overlap rate, that is, the fraction of text contained in a reproduced substring of at least $50$ characters. We find non-trivial overlap rates for both our broad set of controlled prompts (a) and real-world interactions (b). Additional models are in \ref{['ap:additional-models']}.
  • Figure 3: Non-adversarial reproduction is long-tailed. We calculate the number of generated texts that have a minimum reproduced substring length (left) and a minimum overlap rate (right). The overlap rate is the fraction of text contained in a reproduced substring of at least $50$ characters. We combine generations from all models and distinguish between text types. This reveals that non-adversarial reproduction is long-tailed, with few generations containing high overlap rates and very long reproduced strings.
  • Figure 4: Expository writing tasks elicit more reproduction than creative writing. We compare the overlap rate (fraction of text contained in a $50$-character string on the Internet) across text types and tasks. The amount of non-adversarial reproduction consistently differs between text types, but even more so between individual tasks. We report the balanced mean over tasks in (a) and the statistics over all models together in (b).
  • Figure 5: LLMs emit longer sequences of existing text than humans. We report the percentage of texts that contain a minimum-length reproduction of text on the Internet. We compare human texts to the minimum and maximum percentage over all LLMs at every length. LLMs consistently reproduce longer sequences than humans across all text types. We attribute the long human tail in (b) to blatant plagiarism (see \ref{['ssec:qualitative']}).
  • ...and 6 more figures