Table of Contents
Fetching ...

The Mosaic Memory of Large Language Models

Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye

TL;DR

This work reframes LLM memorization as mosaic memory, where models memorize across fuzzy duplicates rather than only exact training repetitions. By injecting synthetic reference canaries and measuring Membership Inference Attack performance, it defines the exact duplicate equivalent $\rho$ to compare memorization from fuzzy versus exact copies. Across multiple model families, fuzzy duplicates contribute substantial memorization, mostly through syntactic token overlap, and this mosaic memory remains robust to insertions, shuffling, and paraphrase-based perturbations. The study also shows that real-world datasets like SlimPajama harbor extensive fuzzy duplicates despite deduplication, implying privacy and benchmarking challenges and motivating more nuanced deduplication and benchmark decontamination strategies.

Abstract

As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models display reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. Taken together, our results challenge widely held beliefs and show memorization to be a more complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.

The Mosaic Memory of Large Language Models

TL;DR

This work reframes LLM memorization as mosaic memory, where models memorize across fuzzy duplicates rather than only exact training repetitions. By injecting synthetic reference canaries and measuring Membership Inference Attack performance, it defines the exact duplicate equivalent to compare memorization from fuzzy versus exact copies. Across multiple model families, fuzzy duplicates contribute substantial memorization, mostly through syntactic token overlap, and this mosaic memory remains robust to insertions, shuffling, and paraphrase-based perturbations. The study also shows that real-world datasets like SlimPajama harbor extensive fuzzy duplicates despite deduplication, implying privacy and benchmarking challenges and motivating more nuanced deduplication and benchmark decontamination strategies.

Abstract

As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models display reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. Taken together, our results challenge widely held beliefs and show memorization to be a more complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.
Paper Structure (14 sections, 7 equations, 14 figures, 7 tables)

This paper contains 14 sections, 7 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: LLMs have a mosaic memory. The exact duplicate equivalent $\rho$ for fuzzy duplicates across number of replacements made. For smaller values of $R$, fuzzy duplicates contribute to memorization almost equally well than exact duplicates ($\rho=1$) while for larger values of $R$, memorization remains significantly higher than if the canary was entirely absent from the training dataset ($\rho>0$). The mosaic memory is present across widely used models: GPT-NEO 1.3B gpt-neo, Gemma-2B team2024gemma, Phi-2 javaheripi2023phi and Llama-3.2-1B dubey2024llama.
  • Figure 2: Memorization of fuzzy duplicates constructed with $\mathcal{A}_{\text{insert}}$ and $\mathcal{A}_{\text{shuffle}}$. (a) The exact duplicate equivalent $\rho$ for fuzzy duplicates when $n$-grams are separated by $X_{\text{insert}}$ tokens ($\mathcal{A}_{\text{insert}}$). Results demonstrate the model's ability to recognize and memorize content fragments despite the presence of varying amounts of noise tokens inserted between meaningful chunks. Different values of $X_{\text{insert}}$ represent different numbers of random tokens inserted between each $n$-gram, with $X_{\text{insert}} = \infty$ representing the baseline case where $n$-grams are randomly scattered throughout the training dataset. (b) The exact duplicate equivalent $\rho$ for fuzzy duplicates obtained by shuffling $n$-grams ($\mathcal{A}_{\text{shuffle}}$). Results illustrate the impact of token reordering on memorization, with Kendall-Tau distance ($\tau$) measuring the degree of permutation between token pairs. Higher $\tau$ values indicate greater departure from the original sequence order. Different $n$ values represent different sizes of $n$-grams kept intact while their positions were shuffled. Dashed lines show baseline memorization levels for each $n$ value when $n$-grams are randomly scattered ($X_{\text{insert}} = \infty$).
  • Figure 3: Mosaic memory for varying level of semantic coherence across fuzzy duplicates. The exact duplicate equivalent $\rho$ for fuzzy duplicates when tokens are replaced with a token sampled from the top $k$ predictions returned by the masked language model $\textit{MLM}$. When $k=10$, tokens are replaced by one of the most likely $10$ tokens predicted by $\textit{MLM}$, while for $k=\mathcal{V}_{\textit{MLM}}$, tokens are effectively replaced by a random token from the MLM's vocabulary.
  • Figure 4: Fuzzy duplicates in SlimPajama. (a) Correlation between Levenshtein distance and the exact duplicate equivalent $\rho$ based on the experimental results reported for $\mathcal{A}_{\text{replace}}$, $\mathcal{A}_{\text{insert}}$ and $\mathcal{A}_{\text{shuffle}}$. (b) Number of fuzzy duplicates (cumulative) found in SlimPajama with increasing Hamming and Levenshtein distance from the original sequence. Reported numbers are averaged over 100 sequences from a subgroup of sequences repeated verbatim $1,000$ ($\pm 1\%$) times in the dataset.
  • Figure 5: The number of fuzzy duplicates in SlimPajama cerebras2023slimpajama impacted by a varying level of deduplication.$n$-gram deduplication strategies for varying $n$ fail to account for a range of fuzzy duplicates still contributing substantially to memorization.
  • ...and 9 more figures