Table of Contents
Fetching ...

How much do language models memorize?

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar

TL;DR

This paper introduces a principled framework to quantify how much language models memorize by separating unintended memorization from generalization. It leverages Kolmogorov-based notions and compression via arithmetic coding to estimate memorization capacity, demonstrating that GPT-style transformers store about $3.6$ bits per parameter and that capacity saturates before dataset size causes grokking. Through synthetic and real-text experiments, it reveals a double-descent phenomenon when data size surpasses model capacity and derives scaling laws predicting membership-inference performance from capacity and data. The work provides practical insights into memory, generalization, and privacy considerations for large transformers, and offers guidance for data curation and model evaluation.

Abstract

We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point "grokking" begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

How much do language models memorize?

TL;DR

This paper introduces a principled framework to quantify how much language models memorize by separating unintended memorization from generalization. It leverages Kolmogorov-based notions and compression via arithmetic coding to estimate memorization capacity, demonstrating that GPT-style transformers store about bits per parameter and that capacity saturates before dataset size causes grokking. Through synthetic and real-text experiments, it reveals a double-descent phenomenon when data size surpasses model capacity and derives scaling laws predicting membership-inference performance from capacity and data. The work provides practical insights into memory, generalization, and privacy considerations for large transformers, and offers guidance for data curation and model evaluation.

Abstract

We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point "grokking" begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from to parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

Paper Structure

This paper contains 47 sections, 3 theorems, 16 equations, 17 figures, 5 tables.

Key Result

Proposition 1

Assume $X=(X_1,\dots, X_n)$ is a dataset of $n$ i.i.d. samples. We have

Figures (17)

  • Figure 1: Unintended memorization of uniform random data (Section \ref{['sec:um-synthetic']}). Memorization plateaus at the empirical capacity limit of different-sized models from the GPT-family, approximately 3.6 bits-per-parameter.
  • Figure 2: Unintended memorization of text across model and dataset sizes (Section \ref{['sec:um-text']}). All quantities are calculated with respect to a large oracle model trained on the full data distribution.
  • Figure 3: In our experiments on synthetic bitstrings, double descent occurs exactly when the dataset size begins to exceed the model's capacity, when unintended memorization is no longer beneficial for lowering the loss.
  • Figure 4: Train and test losses of different model and dataset sizes trained on text. Double descent occurs when dataset size exceeds model capacity.
  • Figure 5: Bits memorized across training. This particular model is a GPT-style transformer with $6.86M$ parameters and a capacity of $23.9$ MB.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Proposition 1: Super-additivity of Unintended Memorization
  • Definition 2: Kolmogorov complexity
  • Definition 3: Kolmogorov memorization
  • Proposition 4
  • Definition 5: Capacity
  • proof
  • proof
  • Lemma 6