Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

Timm Hess; Tinne Tuytelaars; Gido M. van de Ven

Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

Timm Hess, Tinne Tuytelaars, Gido M. van de Ven

TL;DR

It is proposed that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized, thereby opening up a new direction for continual learning research.

Abstract

Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. In search of empirical support for our proposition, we perform a series of pre-registered experiments combining replay-approximated joint objectives with gradient projection-based optimization routines. However, this first experimental attempt fails to show clear and consistent benefits. Nevertheless, our conceptual arguments, as well as some of our empirical results, demonstrate the distinctive importance of the optimization trajectory in continual learning, thereby opening up a new direction for continual learning research.

Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

TL;DR

Abstract

Paper Structure (39 sections, 17 equations, 14 figures, 4 tables, 3 algorithms)

This paper contains 39 sections, 17 equations, 14 figures, 4 tables, 3 algorithms.

INTRODUCTION
TWO PERSPECTIVES TO CONTINUAL LEARNING
The Standard Approach to Continual Learning: Improving the Loss Function
The Stability Gap: A Challenge for the Standard Approach to Continual Learning
Proposed Complementary Approach: Improving the Optimization Trajectory
How to Improve Optimization for Continual Learning?
Another Way to Avoid the Stability Gap?
GRADIENT PROJECTION-BASED OPTIMIZATION
Orthogonal Gradient Projection
Gradient Episodic Memories
PROOF-OF-CONCEPT EXPERIMENTS
Experience Replay with Gradient Projection-based Optimization
Approximating the Joint Loss
Optimization Trajectory
Approaches to Compare
...and 24 more sections

Figures (14)

Figure 1: The stability gap occurs even with incremental joint training (or 'full replay'). Shown is the test accuracy on the first task while the network is incrementally trained on all five tasks of Domain CIFAR. During the $n$-th task, the network is trained jointly on all training data from the first $n$ tasks. Even with this ideal approximation to $\ell_{\text{joint}}$, performance severely drops upon encountering a new task. Displayed are the means over five repetitions, shaded areas are $\,\pm\,1$ standard error of the mean. Vertical dashed lines indicate task switches.
Figure 2: Schematic of the stability gap, and how adjusting the optimization trajectory could avoid it. When, starting from a solution for the old tasks ($\widehat{w}_{\text{old}}$), a proxy of the joint loss ($\widetilde{\ell}_{\text{joint}}$) is optimized with standard stochastic gradient descent, the optimization trajectory first passes through a region in parameter space with high loss on the old tasks before converging to a solution that is good for all tasks ($\widehat{w}_{\text{joint}}$). Work on mode connectivity suggests that a low-loss path between $\widehat{w}_{\text{old}}$ and $\widehat{w}_{\text{joint}}$ exists as well (dashed arrow), indicating that it should be possible to overcome the stability gap with a different optimization routine. Green shading indicates areas of low loss on the old tasks.
Figure 3: Stability gaps for the first task of offline Rotated MNIST. The left side shows standard ER, the right side incremental joint training (or 'full replay') -- both by themselves and in combination with the optimization mechanism of GEM and A-GEM. The middle panels show the test accuracy on the first task while the model is incrementally trained for all tasks of the benchmark. The top panels show zoomed-in views of the first 50 training iterations after a task switch, allowing a more detailed qualitative comparison of the stability gap. These plots show the mean $\,\pm\,$ standard error (shaded area) over five runs with different random seeds. The bottom panel shows for every iteration the proportion of runs where the gradient was projected, with $0$ indicating that at this iteration there was no run in which a gradient was projected and $1$ indicating that there was a gradient projection in every run.
Figure 4: Stability gaps for the first task of offline Domain CIFAR-100. The left side shows standard ER, the right side incremental joint training (or 'full replay') -- both by themselves and in combination with the optimization mechanism of GEM and A-GEM. The middle panels show the test accuracy on the first task while the model is incrementally trained for all tasks of the benchmark. The top panels show zoomed-in views of the first 50 training iterations after a task switch, allowing a more detailed qualitative comparison of the stability gap. These plots show the mean $\,\pm\,$ standard error (shaded area) over five runs with different random seeds. The bottom panel shows for every iteration the proportion of runs where the gradient was projected, with $0$ indicating that at this iteration there was no run in which a gradient was projected and $1$ indicating that there was a gradient projection in every run.
Figure 5: Stability gaps for the first task of offline Split CIFAR-100. The left side shows standard ER, the right side incremental joint training (or 'full replay') -- both by themselves and in combination with the optimization mechanism of GEM and A-GEM. The middle panels show the test accuracy on the first task while the model is incrementally trained for all tasks of the benchmark. The top panels show zoomed-in views of the first 50 training iterations after a task switch, allowing a more detailed qualitative comparison of the stability gap. These plots show the mean $\,\pm\,$ standard error (shaded area) over five runs with different random seeds. The bottom panel shows for every iteration the proportion of runs where the gradient was projected, with $0$ indicating that at this iteration there was no run in which a gradient was projected and $1$ indicating that there was a gradient projection in every run.
...and 9 more figures

Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

TL;DR

Abstract

Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

Authors

TL;DR

Abstract

Table of Contents

Figures (14)