Local vs Global continual learning

Giulia Lanzillotta; Sidak Pal Singh; Benjamin F. Grewe; Thomas Hofmann

Local vs Global continual learning

Giulia Lanzillotta, Sidak Pal Singh, Benjamin F. Grewe, Thomas Hofmann

TL;DR

This work reframes continual learning as an approximation problem for the multi-task loss $\mathcal{L}^{MT}_t(\bm{\theta})$, comparing local versus global loss approximations and formalizing the locality assumption. It derives optimal objectives for quadratic local approximations and proves that Orthogonal Gradient Descent (OGD) implements this local optimum, linking established methods to the theory. Through extensive experiments on split CIFAR-10/100, Tiny ImageNet, and Rotated-MNIST, the authors show local approaches excel when locality holds but degrade as parameter updates move far from past task solutions, whereas global approaches show more consistent forgetting behavior across update sizes. The results offer principled guidance for selecting continual learning strategies under compute/memory constraints and highlight the loss-approximation lens as a unifying framework for understanding forgetting dynamics.

Abstract

Continual learning is the problem of integrating new information in a model while retaining the knowledge acquired in the past. Despite the tangible improvements achieved in recent years, the problem of continual learning is still an open one. A better understanding of the mechanisms behind the successes and failures of existing continual learning algorithms can unlock the development of new successful strategies. In this work, we view continual learning from the perspective of the multi-task loss approximation, and we compare two alternative strategies, namely local and global approximations. We classify existing continual learning algorithms based on the approximation used, and we assess the practical effects of this distinction in common continual learning settings.Additionally, we study optimal continual learning objectives in the case of local polynomial approximations and we provide examples of existing algorithms implementing the optimal objectives

Local vs Global continual learning

TL;DR

This work reframes continual learning as an approximation problem for the multi-task loss

, comparing local versus global loss approximations and formalizing the locality assumption. It derives optimal objectives for quadratic local approximations and proves that Orthogonal Gradient Descent (OGD) implements this local optimum, linking established methods to the theory. Through extensive experiments on split CIFAR-10/100, Tiny ImageNet, and Rotated-MNIST, the authors show local approaches excel when locality holds but degrade as parameter updates move far from past task solutions, whereas global approaches show more consistent forgetting behavior across update sizes. The results offer principled guidance for selecting continual learning strategies under compute/memory constraints and highlight the loss-approximation lens as a unifying framework for understanding forgetting dynamics.

Abstract

Paper Structure (36 sections, 2 theorems, 41 equations, 6 figures, 3 tables)

This paper contains 36 sections, 2 theorems, 41 equations, 6 figures, 3 tables.

Introduction
Background
Local and Global approximations in continual learning
Problem formulation
Local and global approximations
Case study: local polynomial approximations
Quadratic local approximations
Local and Global algorithms in the literature
Global algorithms
Local algorithms
Experiments
Experimental setup.
Local VS Global
Main experiment.
iCarl.
...and 21 more sections

Key Result

Theorem 4.1

For any continual learning algorithm producing a sequence of parameters $\bm\theta_1, \dots, \bm\theta_t$ such that $\bm\theta_i$ is a local minima of $L_i$ and $\sup_{\bm\theta_i, \bm\theta_k} \|\bm\theta_i - \bm\theta_k\|^3 < \epsilon$ the following relationship holds: Moreover, if ${E}(1), \dots, {E}(t-1)=0$ the optimal learning objective for task $t$ is:

Figures (6)

Figure 1: Distance travelled in the parameter space as a function of the optimizer learning rate and the task. We use a color coding of the tasks (a brighter color corresponding to a later task) . For each task, we measure the Euclidean distance between $\bm\theta_t$ and the initialization $\bm\theta_0$. We plot results over all algorithms and random seeds (for a total of $5$). Finally, the yellow dashed line is provided as a reference of the relative scale of the $y$-axes across datasets.
Figure 2: Comparison of random sampling (ours) and herding (standard) buffer selection strategies in iCarl. Higher learning rates, associated with non-local learning are shown in shades of blue, while learning rates associated with local learning are shown in shades of red. Each experiment is repeated over $5$ random seeds (plotted as different points).
Figure 3: In orange, the perturbation score $\mathfrak{s}(r)$ (\ref{['score-ptb']}) and in gray, the task loss, evaluated on train and test data on the Split CIFAR 10 tasks. The shaded area around the curves reflects standard deviation across tasks. Different lines correspond to different perturbation directions (the first $10$ eigenvectors of the the corresponding loss). We evaluate the curves for multiple values of $r$ on a logarithmic scale in the range $[10^{-3}, 10^{6}]$. The shape of the curve is remarkably stable across tasks. Also, notice that the score $\mathfrak{s}(r)$ on the test data is large even for $r=0$, which indicates that the test loss is not $0$ at the local optima.
Figure 4: Rank and effective rank for various threshold $\lambda$ values on a tiny Rotated MNIST challenge. The values are averaged over $5$ seeds.
Figure 5: In orange, the perturbation score $\mathfrak{s}(r)$ (\ref{['score-ptb']}) and in gray, the task loss, evaluated on train and test data on the Rotated-MNIST 20 tasks. The shaded area around the curves reflects standard deviation across tasks. Different lines correspond to different perturbation directions (the first $10$ eigenvectors of the the corresponding loss). We evaluate the curves for multiple values of $r$ on a logarithmic scale in the range $[10^{-3}, 10^{6}]$. The shape of the curve is remarkably stable across tasks.
...and 1 more figures

Theorems & Definitions (3)

Definition 3.1: Local and global task loss approximation.
Theorem 4.1: Optimal quadratic local continual learning
Theorem 5.1

Local vs Global continual learning

TL;DR

Abstract

Local vs Global continual learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)