Table of Contents
Fetching ...

Measuring memorization in language models via probabilistic extraction

Jamie Hayes, Marika Swanberg, Harsh Chaudhari, Itay Yona, Ilia Shumailov, Milad Nasr, Christopher A. Choquette-Choo, Katherine Lee, A. Feder Cooper

TL;DR

The paper tackles the problem that traditional one-shot, greedy discoverable extraction underestimates memorization risk in LLMs. It proposes a probabilistic framework, $(n,p)$-discoverable extraction, to quantify the likelihood of extracting a target sequence over $n$ independent queries under a chosen sampling scheme, using per-query success probability $p_z$ and the relation $1-(1-p_z)^n \ge p$ to connect $n$ and $p$. It extends naturally to non-verbatim targets with $(\epsilon,n,p)$-discoverable extraction and demonstrates that this probabilistic measure reveals higher and more nuanced extraction risks than greedy methods, across model sizes and data repetitions, with no extra computational overhead. The experimental results show that training-data extraction rates can be substantially higher than those for unseen test data, supporting the interpretation that the probabilistic metric captures memorization risk in practice. Overall, the approach provides a reliable, scalable, and flexible tool for assessing memory leakage risk in LLMs and informs safer release and usage policies.

Abstract

Large language models (LLMs) are susceptible to memorizing training data, raising concerns about the potential extraction of sensitive information at generation time. Discoverable extraction is the most common method for measuring this issue: split a training example into a prefix and suffix, then prompt the LLM with the prefix, and deem the example extractable if the LLM generates the matching suffix using greedy sampling. This definition yields a yes-or-no determination of whether extraction was successful with respect to a single query. Though efficient to compute, we show that this definition is unreliable because it does not account for non-determinism present in more realistic (non-greedy) sampling schemes, for which LLMs produce a range of outputs for the same prompt. We introduce probabilistic discoverable extraction, which, without additional cost, relaxes discoverable extraction by considering multiple queries to quantify the probability of extracting a target sequence. We evaluate our probabilistic measure across different models, sampling schemes, and training-data repetitions, and find that this measure provides more nuanced information about extraction risk compared to traditional discoverable extraction.

Measuring memorization in language models via probabilistic extraction

TL;DR

The paper tackles the problem that traditional one-shot, greedy discoverable extraction underestimates memorization risk in LLMs. It proposes a probabilistic framework, -discoverable extraction, to quantify the likelihood of extracting a target sequence over independent queries under a chosen sampling scheme, using per-query success probability and the relation to connect and . It extends naturally to non-verbatim targets with -discoverable extraction and demonstrates that this probabilistic measure reveals higher and more nuanced extraction risks than greedy methods, across model sizes and data repetitions, with no extra computational overhead. The experimental results show that training-data extraction rates can be substantially higher than those for unseen test data, supporting the interpretation that the probabilistic metric captures memorization risk in practice. Overall, the approach provides a reliable, scalable, and flexible tool for assessing memory leakage risk in LLMs and informs safer release and usage policies.

Abstract

Large language models (LLMs) are susceptible to memorizing training data, raising concerns about the potential extraction of sensitive information at generation time. Discoverable extraction is the most common method for measuring this issue: split a training example into a prefix and suffix, then prompt the LLM with the prefix, and deem the example extractable if the LLM generates the matching suffix using greedy sampling. This definition yields a yes-or-no determination of whether extraction was successful with respect to a single query. Though efficient to compute, we show that this definition is unreliable because it does not account for non-determinism present in more realistic (non-greedy) sampling schemes, for which LLMs produce a range of outputs for the same prompt. We introduce probabilistic discoverable extraction, which, without additional cost, relaxes discoverable extraction by considering multiple queries to quantify the probability of extracting a target sequence. We evaluate our probabilistic measure across different models, sampling schemes, and training-data repetitions, and find that this measure provides more nuanced information about extraction risk compared to traditional discoverable extraction.

Paper Structure

This paper contains 52 sections, 10 equations, 17 figures.

Figures (17)

  • Figure 1: Left: The prefix ${\bm{z}}^t_{1:50}$, and portions of the greedy-sampled suffix and of example top-$k$-sampled suffixes for Pythia 12B. Blue indicates a match with the target, red a mismatch. Right: For each successive token that is decoded by greedy and top-$k$ ($k\!=\!40$, $T\!=\!1$) sampling, we plot the probability rank with respect to the target suffix token. At index $87$, the target token has rank $2$; greedy sampling does not select this token, after which the greedy-generated sequence diverges from the target. In contrast, top-$k$ sampling picks the rank-$2$ token and proceeds to extract the target sequence correctly (with probability $16.2\%$). Note that, if greedy sampling had selected the rank-$2$ token at index $87$, then it would have generated the target, as the remaining target tokens all have rank-$1$.
  • Figure 2: For $250$ examples in the Pile (Wikipedia subset) and Pythia 6.9B, we check that generating $n\!\!=\!\!1000$ sequences and computing the probability a training example appears at least once in the set (empirical $p$) matches the theoretical $p$ using Equation (\ref{['eq:npmem']}).
  • Figure 3: For $10,000$ examples from the Enron dataset, we plot variations in $(n, p)$-discoverable extraction rates for models of different sizes, according to different query budgets $n$ and minimum extraction probability $p$.
  • Figure 4: Maximum extraction (Enron, Pythia 2.8B).
  • Figure 5: Comparing extraction rates across two different models, GPT-Neo 1.3B and Pythia 1B, using Enron.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Definition 2.1: Discoverable extraction
  • Definition 3.1: $(n, p)$-discoverable extraction
  • Definition 3.2: $(\epsilon, n, p)$-dis-cov-er-able extraction