Table of Contents
Fetching ...

Gradient Localization Improves Lifelong Pretraining of Language Models

Jared Fernandez, Yonatan Bisk, Emma Strubell

TL;DR

It is demonstrated that targeting parameter updates to these relevant layers can improve the performance of continually pretraining on language containing temporal drift, and it is hypothesized that the lack of consideration of the locality of knowledge in existing continual learning methods contributes to both: the failed uptake of new information, and catastrophic forgetting of previously learned information.

Abstract

Large Language Models (LLMs) trained on web-scale text corpora have been shown to capture world knowledge in their parameters. However, the mechanism by which language models store different types of knowledge is poorly understood. In this work, we examine two types of knowledge relating to temporally sensitive entities and demonstrate that each type is localized to different sets of parameters within the LLMs. We hypothesize that the lack of consideration of the locality of knowledge in existing continual learning methods contributes to both: the failed uptake of new information, and catastrophic forgetting of previously learned information. We observe that sequences containing references to updated and newly mentioned entities exhibit larger gradient norms in a subset of layers. We demonstrate that targeting parameter updates to these relevant layers can improve the performance of continually pretraining on language containing temporal drift.

Gradient Localization Improves Lifelong Pretraining of Language Models

TL;DR

It is demonstrated that targeting parameter updates to these relevant layers can improve the performance of continually pretraining on language containing temporal drift, and it is hypothesized that the lack of consideration of the locality of knowledge in existing continual learning methods contributes to both: the failed uptake of new information, and catastrophic forgetting of previously learned information.

Abstract

Large Language Models (LLMs) trained on web-scale text corpora have been shown to capture world knowledge in their parameters. However, the mechanism by which language models store different types of knowledge is poorly understood. In this work, we examine two types of knowledge relating to temporally sensitive entities and demonstrate that each type is localized to different sets of parameters within the LLMs. We hypothesize that the lack of consideration of the locality of knowledge in existing continual learning methods contributes to both: the failed uptake of new information, and catastrophic forgetting of previously learned information. We observe that sequences containing references to updated and newly mentioned entities exhibit larger gradient norms in a subset of layers. We demonstrate that targeting parameter updates to these relevant layers can improve the performance of continually pretraining on language containing temporal drift.

Paper Structure

This paper contains 18 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: When continually pretraining on sequences with updated and newly mentioned entities, certain layers consistently observe larger gradient norms.
  • Figure 2: Relative gradient norms for the salient spans in ECBD and TempLAMA for the GPT-2 Base (110M; Left-hand side), and GPT-2 Large (770M; Right-hand side), models. Norms for attention (Top) and norms for MLP (Bottom) are depicted separately. Rradient norms of salient spans are 4 to 15x larger than those of the full sequence.
  • Figure 3: Relative Gradient Norms for the GPT-Neo 1.3B parameter model.