A Multi-Perspective Analysis of Memorization in Large Language Models

Bowen Chen; Namgi Han; Yusuke Miyao

A Multi-Perspective Analysis of Memorization in Large Language Models

Bowen Chen, Namgi Han, Yusuke Miyao

TL;DR

The paper investigates memorization in large language models (LLMs) from multiple angles to uncover how, when, and why memorized content emerges. It introduces a formal memorization criterion $M(X,Y)$ and a prediction framework using token- and sentence-level metrics, across a range of Pythia models from 70M to 12B parameters. Key findings include non-linear memorization scaling with model size and context, boundary effects in input and decoding dynamics, embedding-space clustering indicating paraphrase memorization, and the feasibility of predicting memorization with a Transformer. These results advance understanding of memorization mechanics and have implications for privacy, data contamination, and safer LLM deployment through improved anticipation of memorized content.

Abstract

Large Language Models (LLMs), trained on massive corpora with billions of parameters, show unprecedented performance in various fields. Though surprised by their excellent performances, researchers also noticed some special behaviors of those LLMs. One of those behaviors is memorization, in which LLMs can generate the same content used to train them. Though previous research has discussed memorization, the memorization of LLMs still lacks explanation, especially the cause of memorization and the dynamics of generating them. In this research, we comprehensively discussed memorization from various perspectives and extended the discussion scope to not only just the memorized content but also less and unmemorized content. Through various studies, we found that: (1) Through experiments, we revealed the relation of memorization between model size, continuation size, and context size. Further, we showed how unmemorized sentences transition to memorized sentences. (2) Through embedding analysis, we showed the distribution and decoding dynamics across model size in embedding space for sentences with different memorization scores. The n-gram statistics analysis presents d (3) An analysis over n-gram and entropy decoding dynamics discovered a boundary effect when the model starts to generate memorized sentences or unmemorized sentences. (4)We trained a Transformer model to predict the memorization of different models, showing that it is possible to predict memorizations by context.

A Multi-Perspective Analysis of Memorization in Large Language Models

TL;DR

The paper investigates memorization in large language models (LLMs) from multiple angles to uncover how, when, and why memorized content emerges. It introduces a formal memorization criterion

and a prediction framework using token- and sentence-level metrics, across a range of Pythia models from 70M to 12B parameters. Key findings include non-linear memorization scaling with model size and context, boundary effects in input and decoding dynamics, embedding-space clustering indicating paraphrase memorization, and the feasibility of predicting memorization with a Transformer. These results advance understanding of memorization mechanics and have implications for privacy, data contamination, and safer LLM deployment through improved anticipation of memorized content.

Abstract

Paper Structure (31 sections, 2 equations, 18 figures, 6 tables)

This paper contains 31 sections, 2 equations, 18 figures, 6 tables.

Introduction
Related Works
Scaling Laws of LLM
Memorization
Experiment Seeting
Criteria
Memorization Criteria
Prediction Criteria
Model Setting
Experiment Results
Memorization Factors
The Factor of Model Size
Context and Complement Size
Memorization Transition
Input Dyanamics
...and 16 more sections

Figures (18)

Figure 1: Memorization and Research Scope in this study
Figure 2: Memorization Statistics Across Model Size, Complement Size, and Contenxt Size
Figure 3: Transition Across Different Model Size
Figure 4: One-gram Analysis at Each Index
Figure 5: Embedding Dynamics of Across Different Model Size. Memorized Token Num $x$ means $x$ generated tokens are the same with the true continuation.
...and 13 more figures

A Multi-Perspective Analysis of Memorization in Large Language Models

TL;DR

Abstract

A Multi-Perspective Analysis of Memorization in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)