Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Jiyeon Kim; Hyunji Lee; Hyowon Cho; Joel Jang; Hyeonbin Hwang; Seungpil Won; Youbin Ahn; Dohaeng Lee; Minjoon Seo

Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Jiyeon Kim, Hyunji Lee, Hyowon Cho, Joel Jang, Hyeonbin Hwang, Seungpil Won, Youbin Ahn, Dohaeng Lee, Minjoon Seo

TL;DR

This work introduces knowledge entropy, a metric that quantifies how broadly a language model engages its parametric memory stored in FFN memory vectors during pretraining. By modeling FFN as FFN$(oldsymbol{x}) = f(oldsymbol{x} K^{ op}) V$ and measuring layer-wise coefficients to derive entropy $ ext{H}( heta) = extstyle\, imes \,ig( extstyleig) $, the authors show a consistent decline in knowledge entropy as pretraining progresses, correlating with reduced knowledge acquisition and increased forgetting in continual learning. They validate this through experiments on OLMo 1B/7B with datasets like PubMed, C4, and a Fictional Knowledge suite, observing that resuscitating inactive memory vectors can partly restore acquisition and retention capabilities. The study suggests that mid-stage pretraining offers a practical balance between representation richness and plasticity, and demonstrates that increasing memory-vector activity can mitigate some losses in continual knowledge integration, pointing to avenues for improving pretraining strategies and continual learning in large language models.

Abstract

In this work, we investigate how a model's tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model's ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model's knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model's capacity for knowledge acquisition and retention.

Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

TL;DR

This work introduces knowledge entropy, a metric that quantifies how broadly a language model engages its parametric memory stored in FFN memory vectors during pretraining. By modeling FFN as FFN

and measuring layer-wise coefficients to derive entropy

, the authors show a consistent decline in knowledge entropy as pretraining progresses, correlating with reduced knowledge acquisition and increased forgetting in continual learning. They validate this through experiments on OLMo 1B/7B with datasets like PubMed, C4, and a Fictional Knowledge suite, observing that resuscitating inactive memory vectors can partly restore acquisition and retention capabilities. The study suggests that mid-stage pretraining offers a practical balance between representation richness and plasticity, and demonstrates that increasing memory-vector activity can mitigate some losses in continual knowledge integration, pointing to avenues for improving pretraining strategies and continual learning in large language models.

Abstract

Paper Structure (50 sections, 7 equations, 26 figures, 2 tables, 1 algorithm)

This paper contains 50 sections, 7 equations, 26 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Dynamics of Knowledge in Language Models
Entropy in Natural Language Processing
Knowledge Entropy
Definition
Experiment Setup
Final Models Tend to Exhibit Lower Knowledge Entropy
Similar Trends are Observed by Different Definitions of Entropy
Entropy of Attention Layers
Entropy of Next Token Prediction
Knowledge Acquisition and Forgetting
Experiment Setup
Model & Hyperparameters
Dataset
...and 35 more sections

Figures (26)

Figure 1: Illustration of our findings: distribution of memory coefficients $\bar{C}$ in feed-forward layers become sparser throughout pretraining, as indicated by a decrease in knowledge entropy $\mathcal{H}(\theta)$. This sparsity deteriorates the model's knowledge acquisition $\mathcal{A}(\theta)$ and increases forgetting $\mathcal{F}(\theta)$ when conducting continual knowledge learning with models from different pretraining stages. Thereby, as denoted by the star, when we artificially increase the knowledge entropy of the final stage model, both knowledge acquisition and retention increase.
Figure 2: Entropy (y-axis) across different model states (x-axis) for OLMo 1B and 7B. The x-axis represents the rate of the current step relative to the last step (738k for 1B and 557k for 7B).
Figure 3: Entropy (y-axis) defined with Attention weight and next token prediction probability across different model states (x-axis) for OLMo 1B.
Figure 4:
Figure 5:
...and 21 more figures

Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

TL;DR

Abstract

Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Authors

TL;DR

Abstract

Table of Contents

Figures (26)