Table of Contents
Fetching ...

Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

A. Feder Cooper, Mark A. Lemley, Christopher De Sa, Lea Duesterwald, Allison Casasola, Jamie Hayes, Katherine Lee, Daniel E. Ho, Percy Liang

Abstract

Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

Abstract

Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

Paper Structure

This paper contains 183 sections, 20 theorems, 145 equations, 29 figures, 15 tables, 5 algorithms.

Key Result

Theorem 1

Fix a token vocabulary ${\mathbb{V}}$ and $T\in{\mathbb{N}}$. For all ${\bm{b}},{\bm{c}}\in {\mathbb{V}}^T$, Further:

Figures (29)

  • Figure 1: Probabilistic extraction. For $\theta\!=$Llama 1 13B and a training sequence ${\bm{z}}$ from The Great Gatsby, we show prefix ${\bm{z}}_{\textnormal{(pre)}}\coloneqq{\bm{z}}_{1:a}$ and $3$ continuations $\hat{{\bm{z}}}_{\textnormal{(cont)}} \coloneqq\hat{{\bm{z}}}_{a+1:a+T}$ under $\phi\!=$ top-$k\!=\!40$ (Equation \ref{['eq:topk:main']}) with conditional probabilities $\Pr_{\theta, k}(\hat{{\bm{z}}}_{\textnormal{(cont)}} \mid {\bm{z}}_{\textnormal{(pre)}})$ (Equation \ref{['eq:pz:main']}). We diff each $\hat{{\bm{z}}}_{\textnormal{(cont)}}$ with the target suffix ${\bm{z}}_{\textnormal{(suf)}}\coloneqq{\bm{z}}_{a+1:a+T}$ (character space: blue additions, red deletions) and quantify the Levenshtein distance (token space, Equation \ref{['eq:dist:main']}). We highlight verbatim extraction (i.e., $\hat{{\bm{z}}}_{\textnormal{(cont)}}\!=\!{\bm{z}}_{\textnormal{(suf)}}$, $0.1431\geq\tau_{\min}=0.001$), which is not the greedy continuation (top row). All three $\hat{{\bm{z}}}_{\textnormal{(cont)}}$ are near-verbatim matches to ${\bm{z}}_{\textnormal{(suf)}}$ (Section \ref{['sec:warmup']}).
  • Figure 2: Monte Carlo (MC) estimation. For Levenshtein distance $\leq\!5$ ($\hat{p}^{\mathsf{Lev}}_{{\bm{z}},\, 5}$), we plot convergence for a single sequence ${\bm{z}}$ from The Great Gatsby for Llama 2 7B, showing the pooled MC estimate with a 95% confidence interval over 3 replicates. Our algorithm ($k$-CBS, Section \ref{['sec:kcbs']}) produces a deterministic, provably correct lower bound (LB) of $\approx\!0.01$. It captures $89.4\%$ of the mean MC estimate at $M\!=\!10^4$ samples, at a cost of $\approx\!20$ MC samples (Appendix \ref{['app:sec:intuition:cost']})---a budget at which MC produces no hits.
  • Figure 3: Comparing extraction rates. For OLMo 2 7B, 13B, and 32B, we show rates for verbatim ($\varepsilon\!=\!0$) and near-verbatim extraction for $\mathsf{Lev}\, \varepsilon \in \{1,\ldots,5\}$. For greedy near-verbatim, one generates the single greedy $\hat{{\bm{z}}}_{\textnormal{(cont)}}$ and checks if $\mathsf{Lev}(\hat{{\bm{z}}}_{\textnormal{(cont)}},{\bm{z}}_{\textnormal{(suf)}}) \leq \varepsilon$. We use a sample of $10{,}000$ sequences from Wikipedia from OLMo 2's training data; to assess validity, we also run analogous negative controls on $5{,}000$ held-out sequences scraped from Wikipedia that post-date OLMo 2's training cutoff. Greedy rates are exact. Probabilistic rates are computed with $k$-CBS (Section \ref{['sec:kcbs:baseline']}); they may miss some valid instances of extraction, and thus should be interpreted as lower bounds on extraction rates.
  • Figure 4: Near-verbatim mass vs. verbatim mass.Llama 2 on The Great Gatsby; each point is one sequence. Axes show near-verbatim ($p_{{\bm{z}},5}^\mathsf{Lev}$, $\mathsf{Lev}\,\varepsilon\!=\!5$) vs. verbatim ($p_{\bm{z}}$) extraction mass on a $\log$--$\log$ scale. Red/orange points are "unlocked" by near-verbatim extraction (to the left of the $\tau_{\min}$ dotted reference line, $p_{\bm{z}}\!<\!\tau_{\min}$, but $p_{{\bm{z}},5}^\mathsf{Lev}\!\geq\!\tau_{\min}$); blue points are verbatim-extractable ($p_{\bm{z}}\!\geq\!\tau_{\min}$). Points above the dashed $y\!=\!x$ line show increased extraction risk when near-verbatim mass is accounted for.
  • Figure 5: CCDF of per-sequence near-verbatim mass gain. For $\mathsf{Lev}\, \varepsilon\!=\!5$ mass minus verbatim mass ($\hat{p}_{{\bm{z}},5}^{\mathsf{Lev}} - p_{\bm{z}}$), a point $(x, y)$ means $y\%$ of sequences have extraction-mass gain $\geq x$.
  • ...and 24 more figures

Theorems & Definitions (41)

  • Theorem 1
  • proof
  • Lemma 2: Low-probability budget from a sequence-mass floor $\tau$
  • proof
  • Corollary 3: Rank-bucket consequences
  • proof
  • Lemma 4: Heavy-mass path survival under across-beam top-$B$
  • proof
  • Lemma 5: No token-level duplicates
  • proof
  • ...and 31 more