Table of Contents
Fetching ...

The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization

Luka Borec, Philipp Sadler, David Schlangen

TL;DR

This work interrogates whether nucleus sampling, via the top_p parameter, can mitigate text memorization in large language models. Using OpenMemText, a controlled OpenWebText-derived diagnostic dataset with deliberate duplication, and fine-tuned GPT-Neo models, the authors evaluate memorization under greedy decoding and under nucleus sampling across $top_p$ values. They find that memorization correlates strongly with data duplication and model size, and that increasing $top_p$ yields only modest reductions in memorized outputs; in many cases, high-frequency memorized tokens drive deterministic selections, undermining the intended effect of stochastic decoding. The study also identifies ramp-up and saturation points, and introduces the notion of soft memorization evidenced by BLEU-4 correlations, highlighting that outputs can closely resemble training data even when not exact copies. Overall, the work suggests that decoding strategy alone may be insufficient to curb memorization risks and motivates further investigation into alternative mitigation methods and scaling effects.

Abstract

This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorization of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in "hard" memorization -- a verbatim reproduction of training samples -- they may still display "soft" memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.

The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization

TL;DR

This work interrogates whether nucleus sampling, via the top_p parameter, can mitigate text memorization in large language models. Using OpenMemText, a controlled OpenWebText-derived diagnostic dataset with deliberate duplication, and fine-tuned GPT-Neo models, the authors evaluate memorization under greedy decoding and under nucleus sampling across values. They find that memorization correlates strongly with data duplication and model size, and that increasing yields only modest reductions in memorized outputs; in many cases, high-frequency memorized tokens drive deterministic selections, undermining the intended effect of stochastic decoding. The study also identifies ramp-up and saturation points, and introduces the notion of soft memorization evidenced by BLEU-4 correlations, highlighting that outputs can closely resemble training data even when not exact copies. Overall, the work suggests that decoding strategy alone may be insufficient to curb memorization risks and motivates further investigation into alternative mitigation methods and scaling effects.

Abstract

This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorization of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in "hard" memorization -- a verbatim reproduction of training samples -- they may still display "soft" memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.
Paper Structure (26 sections, 1 equation, 7 figures, 3 tables)

This paper contains 26 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The effect of different top_p values (x-axis) on the fraction of the duplicated texts memorized by the models (y-axis). The top_p parameter determines the maximally considered accumulated probability mass for the output token selection during nucleus sampling. Higher top_p values generally lead to reduced memorization, yet the decrease is less significant than expected. This effect is observed across two models of different model sizes, with the larger model showing a somewhat less pronounced reduction in memorization compared to the smaller model. The dashed lines show the baseline behavior using greedy decoding.
  • Figure 2: During fine-tuning we measure a consistent decrease in both training and validation loss which indicates that the GPT-Neo models are fitting better to the memorization dataset data over time.
  • Figure 3: Results from our replication of quantifying-memorization. The two fine-tuned GPT-Neo models were compared to non-fine-tuned GPT-2 models of similar sizes using the same prompts. (a) The larger model memorized more of the training dataset than the smaller one. (b) Repeated data in the training set is more likely to be extractable. (c) There is a gradual increase in the extraction of memorized text as the length of input context increases.
  • Figure 4: Heatmap illustrating the inverse relationship between top_p parameter values and extracted memorized text, modulated by the number of data repetitions in steps of five. It highlights the unexpected trend that for a high number of data copies, memorization levels remain significant for all top_p values, while fewer data repetitions lead to markedly lower memorization when top_p is increased, reflecting the models' shift from rote memory to learned generalizations.
  • Figure 5: This more fine-grained view between $15$ to $20$ data copies delineates the ramp-up point where memorization begins to climb sharply and approaches the saturation point where further data addition has diminished effects on memorization rates. This illustrates how, despite increasing top_p values which typically reduce memorization, the presence of high repetition still results in substantial memorization, particularly in the GPT-Neo $350$M model.
  • ...and 2 more figures