Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Boxi Cao; Qiaoyu Tang; Hongyu Lin; Shanshan Jiang; Bin Dong; Xianpei Han; Jiawei Chen; Tianshu Wang; Le Sun

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Boxi Cao, Qiaoyu Tang, Hongyu Lin, Shanshan Jiang, Bin Dong, Xianpei Han, Jiawei Chen, Tianshu Wang, Le Sun

TL;DR

This work probes the memory behavior of language models by contrasting vanilla (unpretrained) models with pre-trained models using a factual knowledge acquisition setup derived from LAMA. It introduces a testbed that tracks forgetting curves across knowledge types, learning cycles, and time scales, revealing a clear transition from short-term to long-term memory driven by pre-training. The findings show vanilla models suffer catastrophic forgetting with limited memory, while pre-trained models develop durable memorization that strengthens with longer pre-training. Crucially, memory formation is modulated by knowledge relevance and diversification, with interference from related facts and low diversity causing memory instability, offering practical guidance for designing learning regimens and evaluation protocols.

Abstract

Memory is one of the most essential cognitive functions serving as a repository of world knowledge and episodes of activities. In recent years, large-scale pre-trained language models have shown remarkable memorizing ability. On the contrary, vanilla neural networks without pre-training have been long observed suffering from the catastrophic forgetting problem. To investigate such a retentive-forgetful contradiction and understand the memory mechanism of language models, we conduct thorough experiments by controlling the target knowledge types, the learning strategies and the learning schedules. We find that: 1) Vanilla language models are forgetful; 2) Pre-training leads to retentive language models; 3) Knowledge relevance and diversification significantly influence the memory formation. These conclusions are useful for understanding the abilities of pre-trained language models and shed light on designing and evaluating new learning and inference algorithms of language models.

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

TL;DR

Abstract

Paper Structure (19 sections, 11 figures, 1 table)

This paper contains 19 sections, 11 figures, 1 table.

Introduction
Factual Knowledge Acquisition Testbed
Dataset.
Language Model Architectures.
Knowledge Acquisition and Forgetting Curves.
Definitions.
Vanilla Language Models are Forgetful
Limited Memory Duration
Limited Memory Capacity
Pre-training Leads to Retentive Language Models
Pre-trained Language Models are Retentive
Pre-training Leads to Long-term Memorizing
Knowledge Relevance and Diversification Affects Memory Formation
Correlations brings Competitions
Memorizing Singularity and its Causes
...and 4 more sections

Figures (11)

Figure 1: The illustrated forgetting curve of vanilla language models and pre-trained language models.
Figure 2: The illustrated experiment process.
Figure 3: The forgetting curve of relation naive language and manufacture.
Figure 4: The correlations between forgetting curve and correlation curve.
Figure 5: The forgetting curve of relation naive language and manufacture.
...and 6 more figures

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

TL;DR

Abstract

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)