Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Yanlai Yang; Matt Jones; Michael C. Mozer; Mengye Ren

Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren

TL;DR

A new mechanism by which over-parametrized neural networks can recover from catastrophic interference is demonstrated and new insights into training over-parameterized networks in cyclically structured environments are uncovered.

Abstract

We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs finetuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. This behavior occurs even though the documents are never presented in context together. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments.

Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

TL;DR

Abstract

Paper Structure (64 sections, 2 equations, 23 figures, 1 table)

This paper contains 64 sections, 2 equations, 23 figures, 1 table.

Introduction
Data and Experiment Setup
Models.
Datasets.
Training Setup.
Emergent Anticipatory Recovery
The Anticipatory Recovery Phenomenon
Anticipatory Recovery is an Emergent Behavior
Anticipatory Recovery in Randomly Initialized Models.
Effects of Model Width and Depth.
Other Influential Factors
Number of Tasks.
Number of Gradient Steps.
Context Length.
Number of Frozen Blocks.
...and 49 more sections

Figures (23)

Figure 1: (a) Loss curves on document 1 for cyclic and random shuffled fine-tuning on a pre-trained Pythia-1B model. The black circles indicate points just prior to training on the focal document. The inverted-U loss curves within each epoch demonstrate the anticipatory recovery phenomenon. (b) Shift-averaged loss curve for cyclic fine-tuning. (c) Online loss curves for cyclic and random shuffled fine-tuning with prequential evaluation.
Figure 2: Effect of model size for (a) pre-trained models and (b) random initializations. In each subfigure, the left shows shift-averaged loss curves and the right shows the recovery score as a function of model size.
Figure 3: Models trained from scratch with (a) different width (token embedding size) and (b) different depth (number of transformer blocks).
Figure 4: Effect of data randomization strength. (a) Random masking with probability up to $0.3$; (b) Random shift of context window up to $128$ tokens.
Figure 5: Effects of (a) number of documents (b) number of gradient steps (c) context length and (d) number of frozen blocks.
...and 18 more figures

Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

TL;DR

Abstract

Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Authors

TL;DR

Abstract

Table of Contents

Figures (23)