Table of Contents
Fetching ...

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang

TL;DR

This work defines distributional memorization and generalization and introduces a task-gram language model built from semantically aligned input-output $n$-grams to approximate pretraining data distributions. Using Pythia models trained on the Pile, it analyzes four tasks—translation, factual QA, world knowledge, and math reasoning—showing memorization dominates knowledge-intensive tasks (notably TriviaQA) while generalization is key for reasoning tasks (GSM8K/MMLU). The study further demonstrates that a task-gram LM better explains LLM predictions than an $\infty$-gram baseline and shows that memorization effects scale with model size for certain tasks. It also explores practical implications, including prompt optimization driven by distributional signals, and provides a scalable framework for probing pretraining data contributions across diverse tasks.

Abstract

The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram language model, which is built by counting the co-occurrence of semantically related $n$-gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth.

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

TL;DR

This work defines distributional memorization and generalization and introduces a task-gram language model built from semantically aligned input-output -grams to approximate pretraining data distributions. Using Pythia models trained on the Pile, it analyzes four tasks—translation, factual QA, world knowledge, and math reasoning—showing memorization dominates knowledge-intensive tasks (notably TriviaQA) while generalization is key for reasoning tasks (GSM8K/MMLU). The study further demonstrates that a task-gram LM better explains LLM predictions than an -gram baseline and shows that memorization effects scale with model size for certain tasks. It also explores practical implications, including prompt optimization driven by distributional signals, and provides a scalable framework for probing pretraining data contributions across diverse tasks.

Abstract

The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram language model, which is built by counting the co-occurrence of semantically related -gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth.
Paper Structure (27 sections, 11 equations, 12 figures, 2 tables)

This paper contains 27 sections, 11 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of our proposed analysis pipeline. For the selected evaluation tasks, we first construct a task-gram table by matching semantically similar $n$-grams from task inputs ($x$) and targets ($y$). These $n$-grams are then searched within the pretraining corpus, yielding their counts and source documents. We then build a task-gram language model from the obtained counts and then analyze their relationship with LLM predictions.
  • Figure 2: Expected distributional memorization and generalization trend for different types of tasks.
  • Figure 3: Task performance v.s. $n$-gram pair count in the Pile with different Pythia model sizes, for four different tasks, from left to right: WMT, TriviaQA, MMLU, and GSM8K.
  • Figure 4: Visualization of distributional memorization with different-sized Pythia models on four tasks: WMT, TriviaQA, MMLU, and GSM8K. We also show results with OLMo models on GSM8K. For WMT, we show the number of new $n$-gram pairs generated by LLMs as the distributional memorization is not significant. For MMLU, we divide the tasks into two categories: knowledge-intensive and reasoning-intensive. For GSM8K, we show the Kendall tau ranking distance instead of Spearman correlation to quantify the distributional generalization effect as the distributional memorization is not significant. Solid lines show distributional memorization computed with our task-gram LM and dashed lines are computed with the $\infty$-gram LM. Statistical significant ($p < 0.05$) values are marked with solid round markers while statistically insignificant values ($p > 0.05$) are marked with gray star markers.
  • Figure 5: Training influence of pretraining documents v.s. Pythia model size with WMT, TriviaQA, and MMLU. Green lines correspond to documents containing $n$-gram pairs, while blue lines correspond to documents containing only the output $n$-gram in $n$-gram pairs.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3