Scaling Laws for Fact Memorization of Large Language Models

Xingyu Lu; Xiaonan Li; Qinyuan Cheng; Kai Ding; Xuanjing Huang; Xipeng Qiu

Scaling Laws for Fact Memorization of Large Language Models

Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuanjing Huang, Xipeng Qiu

TL;DR

This paper quantitatively analyzes how large language models memorize real-world atomic facts, focusing on (key,attribute,value) triples and introducing a memorization-rate metric. It uncovers that fact capacity scales linearly with model size but saturates with training epochs following a negative exponential law, implying extreme compute costs to memorize comprehensive public facts like Wikidata. The study reveals significant inefficiencies in memorizing redundant or derivable facts, and shows that memorization preferences are steered by fact frequency and difficulty, with memorization order causing information overwrite. Importantly, LLMs can generalize to unseen facts with a scaling law akin to pre-training, and generalization correlates with memorization strength, suggesting a potential to selectively leverage generalizable facts while supplementing memory with retrieval-augmented approaches for factual reliability.

Abstract

Fact knowledge memorization is crucial for Large Language Models (LLM) to generate factual and reliable responses. However, the behaviors of LLM fact memorization remain under-explored. In this paper, we analyze the scaling laws for LLM's fact knowledge and LLMs' behaviors of memorizing different types of facts. We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs, respectively. Estimated by the built scaling law, memorizing the whole Wikidata's facts requires training an LLM with 1000B non-embed parameters for 100 epochs, suggesting that using LLMs to memorize all public facts is almost implausible for a general pre-training setting. Meanwhile, we find that LLMs can generalize on unseen fact knowledge and its scaling law is similar to general pre-training. Additionally, we analyze the compatibility and preference of LLMs' fact memorization. For compatibility, we find LLMs struggle with memorizing redundant facts in a unified way. Only when correlated facts have the same direction and structure, the LLM can compatibly memorize them. This shows the inefficiency of LLM memorization for redundant facts. For preference, the LLM pays more attention to memorizing more frequent and difficult facts, and the subsequent facts can overwrite prior facts' memorization, which significantly hinders low-frequency facts memorization. Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.

Scaling Laws for Fact Memorization of Large Language Models

TL;DR

Abstract

Paper Structure (46 sections, 5 equations, 15 figures, 11 tables)

This paper contains 46 sections, 5 equations, 15 figures, 11 tables.

Introduction
RQ1: How does LLM's fact knowledge capacity scale with its size and training epochs?
RQ2: Can LLMs efficiently memorize redundant facts?
RQ3: What influences LLM's memorization preference for different types of fact knowledge?
RQ4: Can LLMs generalize on unseen fact knowledge? What is the relation between fact memorization and generalization?
Preliminary
Atomic Fact Knowledge Memorization
Dataset
Implementation Details
Fact Capacity Scaling Laws
Exploratory Experiment
Scaling Law of Fact Capacity and Model Size
Scaling Law of Fact Capacity and Epochs
Experiments on Wikidata
Redundant Fact Memorization
...and 31 more sections

Figures (15)

Figure 1: The fact capacity of LLMs with different sizes on Wikidata, under 100 training epochs. According to the predicted scaling law, memorizing all Wikidata triples (15B) requires 1000B non-embed parameters.
Figure 2: LLMs' memorization rate under different numbers of training facts.
Figure 3: The relation between LLMs' fact capacity and their model sizes, under fixed training epochs.
Figure 4: The relation between LLMs' fact capacity and training epochs, under fixed model size.
Figure 5: LLMs' memorization of the same facts with different directions, where "*" means facts are from another group of keys. The right is the learning curves.
...and 10 more figures

Scaling Laws for Fact Memorization of Large Language Models

TL;DR

Abstract

Scaling Laws for Fact Memorization of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)