Table of Contents
Fetching ...

Retracing the Past: LLMs Emit Training Data When They Get Lost

Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, Ruoxi Jia

TL;DR

The paper tackles the memorization leakage of training data in LLMs by introducing Confusion-Inducing Attacks (CIA), which systematically maximize token-level prediction entropy to drive models into high-uncertainty states that precede memorized data emission. It augments CIA with mismatched Supervised Fine-Tuning to weaken alignment for aligned models, enabling improved extraction rates without access to training data. Across unaligned models (e.g., Llama 1/2) and aligned models (e.g., Llama 2-Chat, Llama 3-Instruct variants), CIA and CIA+SFT outperform prior baselines, achieving verbatim matches up to $VM@50$ ~22% on unaligned models and up to ~6% on aligned models, with near-verbatim success around 18% under relaxed tolerance. The work also provides a practical verification pipeline via InfiniGram and an ablation study showing the role of entropy objectives and SFT mismatches in driving leakage. Overall, the findings establish a more systematic framework for assessing memorization risks and highlight a concrete physiological signal—the entropy spike—as a precursor to data regurgitation, with implications for privacy and copyright protections in LLM deployment.

Abstract

The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.

Retracing the Past: LLMs Emit Training Data When They Get Lost

TL;DR

The paper tackles the memorization leakage of training data in LLMs by introducing Confusion-Inducing Attacks (CIA), which systematically maximize token-level prediction entropy to drive models into high-uncertainty states that precede memorized data emission. It augments CIA with mismatched Supervised Fine-Tuning to weaken alignment for aligned models, enabling improved extraction rates without access to training data. Across unaligned models (e.g., Llama 1/2) and aligned models (e.g., Llama 2-Chat, Llama 3-Instruct variants), CIA and CIA+SFT outperform prior baselines, achieving verbatim matches up to ~22% on unaligned models and up to ~6% on aligned models, with near-verbatim success around 18% under relaxed tolerance. The work also provides a practical verification pipeline via InfiniGram and an ablation study showing the role of entropy objectives and SFT mismatches in driving leakage. Overall, the findings establish a more systematic framework for assessing memorization risks and highlight a concrete physiological signal—the entropy spike—as a precursor to data regurgitation, with implications for privacy and copyright protections in LLM deployment.

Abstract

The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.

Paper Structure

This paper contains 38 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Conceptual illustration of our Confusion-Inducing Attacks (CIA) compared to heuristic approaches. While heuristic prompts (e.g., "Repeat 'Debug' 50 times", bottom path) often lead to divergence and rarely reveal memorized text, our CIA with optimized tokens like "Aires casa..." deliberately steers the LLM into a high entropy state. This induced uncertain state increases the likelihood of the model revealing memorized training data.
  • Figure 2: Token-wise entropy (bits) for Llama 2 (70B) responses to repetition-based divergence promptsnasr2025scalable. Panels show (Left) simple repetition, (Middle) non-meaningful divergence, and (Right) verbatim memorization (The Lord is my shepherd…). We observe a sustained high-entropy spike preceding memorized text emission in the right panel, which distinguishes it from other behaviors.
  • Figure 3: Response token diversity of different attack methods across varying filtering thresholds. The y-axis shows token diversity (unique tokens / total tokens in generated output, as per Equation \ref{['eq:diversity_score']}), while the x-axis indicates the diversity threshold.
  • Figure 4: Distribution of semantic similarity scores for matched sequences under two tolerance settings: M5@50 and M10@50. While both settings yield high semantic overlap with training data, M5@50 shows consistently higher fidelity with low variance, supporting its utility in identifying near-verbatim memorization.

Theorems & Definitions (1)

  • Definition 2.1: Extractable Memorization