Table of Contents
Fetching ...

Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting

Timm Hess, Eli Verwimp, Gido M. van de Ven, Tinne Tuytelaars

TL;DR

It is shown that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation) and how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.

Abstract

Continual learning research has shown that neural networks suffer from catastrophic forgetting "at the output level", but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting -- that they only forget minimally in comparison with forgetting at the output level. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Taking both aspects into account, we show that, even though forgetting in the representation (i.e. feature forgetting) can be small in absolute terms, when measuring relative to how much was learned during a task, forgetting in the representation tends to be just as catastrophic as forgetting at the output level. Next we show that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation). Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.

Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting

TL;DR

It is shown that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation) and how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.

Abstract

Continual learning research has shown that neural networks suffer from catastrophic forgetting "at the output level", but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting -- that they only forget minimally in comparison with forgetting at the output level. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Taking both aspects into account, we show that, even though forgetting in the representation (i.e. feature forgetting) can be small in absolute terms, when measuring relative to how much was learned during a task, forgetting in the representation tends to be just as catastrophic as forgetting at the output level. Next we show that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation). Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.
Paper Structure (28 sections, 2 equations, 25 figures, 5 tables)

This paper contains 28 sections, 2 equations, 25 figures, 5 tables.

Figures (25)

  • Figure 1: Illustration of feature forgetting and knowledge accumulation, for a model continually finetuned on Split MiniImageNet. On the left, the output accuracy and LP accuracy on $T_6$ are shown. The difference between LP accuracy directly after the task was trained and the LP accuracy after training on later tasks is what we call feature forgetting. On the right, the same continually trained model is evaluated with the same metrics but on a downstream task ($T_d$). We refer to the improvement in LP accuracy over that of a randomly initialized model ($f_0$) as knowledge accumulation. (Mean $\pm$ SE, 5 runs)
  • Figure 2: Performance, absolute forgetting and relative forgetting of $T_6$ for a model continually finetuned on Split MiniImageNet. Comparing output accuracy and representation quality (LP accuracy) can have different interpretations when considered in absolute or relative terms. Absolute forgetting (middle) suggests that output forgetting is worse than at the representation level. Such absolute forgettting does not account for performance already accumulated before learning $T_6$ (left). When expressing forgetting relative to newly gained performance (right), forgetting trends are similar at the output and representation level. (Mean $\pm$ SE, 5 runs)
  • Figure 3: Relative forgetting at the output level and at the level of representations, as per Eq. \ref{['eq:forgetting']}, for a model continually fine-tuned on the Split MiniImageNet sequence. Relative to the knowledge gained during training on the task, forgetting in the representation and at the output are similar, except for the first few tasks. (Mean $\pm$ SE, 5 runs)
  • Figure 4: Illustration of the task exclusion difference (EXC, Eq. \ref{['eq:exclusion']}) for continual finetuning on Split MiniImageNet. Panel (a) depicts the adjusted training sequence for the "exclusion model" $f^{-j}$, and a comparison of the resulting LP accuracy for continually finetuned model $f$ and $f^{-6}$ on task $T_6$. Panel (b) shows that $\text{EXC}_{i,j}$, for the first 9 tasks, always follows the same trend. It averages around zero for tasks $i < j$, peaks for $i=j$, and quickly reduces to almost zero again for $i>j$ (Mean $\pm$ SE, 5 runs).
  • Figure 5: Schematic and result of the ensemble baseline. Panel (a) shows that training the ensemble baseline is the same as continual finetuning, but after every training phase a model copy is stored. During evaluation each stored copies produces a separate representation. Their representations are then concatenated and a linear classifier is trained on top of the concatenation. Panel (b) compares the LP performance of ensemble and finetuning on a downstream task. (Mean $\pm$ SE, 5 runs)
  • ...and 20 more figures