Table of Contents
Fetching ...

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang, David Mohaisen

TL;DR

The paper tackles the challenge of verbatim memorization in large language models by proposing a multi-prefix memorization framework that assesses memory depth via diverse retrieval paths. It defines memorization through an external search that must identify at least $P^s_{f_\theta}$ distinct prefixes eliciting a target sequence, where $P^s_{f_\theta}$ depends on a memorization score $\eta^s_{f_\theta}$. The methodology combines an internal memorization signal with an external adversarial-prefix search (GCG) to robustly distinguish memorized from non-memorized data, and introduces practical cost controls via early stopping. Experiments across model scales, data domains, and alignment regimes show memorized content is more susceptible to elicitation, with prefix diversity revealing the depth of memorization and a lookup-table-like retrieval pattern rather than semantic prompting. The framework provides actionable auditing tools for data leakage detection and offers insights into how model size and instruction-tuning influence memorization, informing mitigation strategies and policy considerations for safer LLM deployment.

Abstract

Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

TL;DR

The paper tackles the challenge of verbatim memorization in large language models by proposing a multi-prefix memorization framework that assesses memory depth via diverse retrieval paths. It defines memorization through an external search that must identify at least distinct prefixes eliciting a target sequence, where depends on a memorization score . The methodology combines an internal memorization signal with an external adversarial-prefix search (GCG) to robustly distinguish memorized from non-memorized data, and introduces practical cost controls via early stopping. Experiments across model scales, data domains, and alignment regimes show memorized content is more susceptible to elicitation, with prefix diversity revealing the depth of memorization and a lookup-table-like retrieval pattern rather than semantic prompting. The framework provides actionable auditing tools for data leakage detection and offers insights into how model size and instruction-tuning influence memorization, informing mitigation strategies and policy considerations for safer LLM deployment.

Abstract

Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

Paper Structure

This paper contains 32 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Discoverable memorization rates for Pythia on Famous Quotes. The ratio reflects quotes elicited via prefix completion. Contrary to expectations, memorization does not scale with model size and declines for largest models. The x-axis signifies the model (Pythia-) size.
  • Figure 2: GCG success rates by model size and sequence type. Memorized sequences consistently yield higher success rates, validating the classification method and revealing the attack's dependence on prior memorization.
  • Figure 3: Memorization behavior in Pythia-6.9B for original Famous Quotes and paraphrased quotes. Top:Memorization rates for original vs. paraphrased quotes. Minimal paraphrasing drops memorization from 84% to 3.2%, showing the method detects verbatim memorization.Bottom:Distribution of recall rates. Adversarial attacks on paraphrased quotes often trigger the model to recall and output the original memorized quote.
  • Figure 4: Semantic similarity statistics for adversarial prefix generation targeting famous quotes. Top:Distribution of cosine distances between adversarial prefixes of each target (Famous Quotes). Bottom:Distribution of cosine similarities between original quotes and their adversarial prefixes.
  • Figure 5: Proportion of sequences from the Famous Quotes dataset classified as memorized by each Pythia model. The results demonstrate a clear trend of increased memorization with model size. The x-axis signifies the model size.
  • ...and 3 more figures