The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization
Luka Borec, Philipp Sadler, David Schlangen
TL;DR
This work interrogates whether nucleus sampling, via the top_p parameter, can mitigate text memorization in large language models. Using OpenMemText, a controlled OpenWebText-derived diagnostic dataset with deliberate duplication, and fine-tuned GPT-Neo models, the authors evaluate memorization under greedy decoding and under nucleus sampling across $top_p$ values. They find that memorization correlates strongly with data duplication and model size, and that increasing $top_p$ yields only modest reductions in memorized outputs; in many cases, high-frequency memorized tokens drive deterministic selections, undermining the intended effect of stochastic decoding. The study also identifies ramp-up and saturation points, and introduces the notion of soft memorization evidenced by BLEU-4 correlations, highlighting that outputs can closely resemble training data even when not exact copies. Overall, the work suggests that decoding strategy alone may be insufficient to curb memorization risks and motivates further investigation into alternative mitigation methods and scaling effects.
Abstract
This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorization of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in "hard" memorization -- a verbatim reproduction of training samples -- they may still display "soft" memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.
