Table of Contents
Fetching ...

The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model

Daniel Goldfarb, Itay Evron, Nir Weinberger, Daniel Soudry, Paul Hand

TL;DR

This work tackles catastrophic forgetting in continual learning by analyzing a two-task linear regression model where the second task is a random orthogonal transform of the first, parameterized by the DOTS-derived similarity $\alpha = m/p$ and overparameterization $\beta = 1 - d/p$. The authors derive an exact non-asymptotic expression for the worst-case forgetting under sequential SGD, revealing a non-monotonic dependence on similarity in highly overparameterized regimes and a monotone relationship near the interpolation threshold. They provide a detailed proof sketch using Haar-orthogonal integrals, and validate the theory with synthetic linear regression and neural-network experiments on permutation benchmarks, including permuted MNIST. The results demonstrate that overparameterization does not always prevent forgetting and that the joint interaction with task similarity can produce a peak in forgetting at intermediate similarity when the model is highly overparameterized. The work offers a principled framework (with explicit $\alpha$ and $\beta$ dependencies) that informs the design of continual-learning systems and bridges theory with neural-network experiments, while outlining clear paths for extending to nonlinear models and more tasks.

Abstract

In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or overparameterization. In contrast, our paper examines how task similarity and overparameterization jointly affect forgetting in an analyzable model. Specifically, we focus on two-task continual linear regression, where the second task is a random orthogonal transformation of an arbitrary first task (an abstraction of random permutation tasks). We derive an exact analytical expression for the expected forgetting - and uncover a nuanced pattern. In highly overparameterized models, intermediate task similarity causes the most forgetting. However, near the interpolation threshold, forgetting decreases monotonically with the expected task similarity. We validate our findings with linear regression on synthetic data, and with neural networks on established permutation task benchmarks.

The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model

TL;DR

This work tackles catastrophic forgetting in continual learning by analyzing a two-task linear regression model where the second task is a random orthogonal transform of the first, parameterized by the DOTS-derived similarity and overparameterization . The authors derive an exact non-asymptotic expression for the worst-case forgetting under sequential SGD, revealing a non-monotonic dependence on similarity in highly overparameterized regimes and a monotone relationship near the interpolation threshold. They provide a detailed proof sketch using Haar-orthogonal integrals, and validate the theory with synthetic linear regression and neural-network experiments on permutation benchmarks, including permuted MNIST. The results demonstrate that overparameterization does not always prevent forgetting and that the joint interaction with task similarity can produce a peak in forgetting at intermediate similarity when the model is highly overparameterized. The work offers a principled framework (with explicit and dependencies) that informs the design of continual-learning systems and bridges theory with neural-network experiments, while outlining clear paths for extending to nonlinear models and more tasks.

Abstract

In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or overparameterization. In contrast, our paper examines how task similarity and overparameterization jointly affect forgetting in an analyzable model. Specifically, we focus on two-task continual linear regression, where the second task is a random orthogonal transformation of an arbitrary first task (an abstraction of random permutation tasks). We derive an exact analytical expression for the expected forgetting - and uncover a nuanced pattern. In highly overparameterized models, intermediate task similarity causes the most forgetting. However, near the interpolation threshold, forgetting decreases monotonically with the expected task similarity. We validate our findings with linear regression on synthetic data, and with neural networks on established permutation task benchmarks.
Paper Structure (54 sections, 20 theorems, 182 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 54 sections, 20 theorems, 182 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 3

Let ${p\ge 4}, d\in \left\{1,\dots,p\right\}, m\ge 2$. Define $\mathcal{X}_{p,d} \triangleq \left\{\, \mathbf{X}\in\mathbb{R}^{n\times p} \mid n\ge \operatorname{rank}(\mathbf{X})=d \,\right\}$. Define the Dimensionality of Transformed Subspace $\alpha\triangleq \frac{m}{p}$ as our proxy for task di where $\mathbf{X}^{+}\mathbf{X}\mathbf{w}^{\star}$ projects $\mathbf{w}^{\star}$ onto the column sp

Figures (9)

  • Figure 1: Informal illustration of our theoretical result. Formal details are shared in Section \ref{['sec:analysis']}.
  • Figure 2: Empirically illustrating the worst-case forgetting under different overparameterization levels. Points indicate the forgetting under 1000 sampled random transformations applied on a (single) random data matrix $\mathbf{X}$. Their mean is shown in the thin orange line, with the standard deviation represented by a gray band. The thick blue line depicts the analytical expression of Theorem \ref{['thm:main']}. Here, we restrict the nonzero singular values of $\mathbf{X}$ to be identical, saturating the inequality in Eq. (\ref{['eq:forgetting-inequality']}). Indeed, the analytical bound matches the empirical mean, thus exemplifying the tightness of our analysis. For completeness, in Appendix \ref{['app:synthetic-figures']}, we repeat this experiment with $p=10$ and $p=1000$.
  • Figure 3: Levelsets depicting our main result from Theorem \ref{['thm:main']}. The entire space (combinations of $\alpha,\beta$) appears on the lower-right subplot. We zoom into more interesting regimes, i.e., high task similarity and high overparameterization, on the lower-left and upper-right subplots (respectively).
  • Figure 4: Results of the numerical simulation. Risk and training error on task 1 are plotted as a function of $\alpha$ for various levels of $p$. The solid (blue) curves denote performance on task 1 of an estimator that is trained on task 1 and then on task 2. The dashed dark lines denote performance on task 1 of an estimator trained on task 1 only. The dotted black line denotes the performance of the null estimator. Training error curves for $w_1$ are omitted as these values are 0 for $p > d$.
  • Figure 5: Three versions of permuted MNIST for $PS=0$ (high similarity), $PS=14$ (intermediate similarity), $PS=28$ (low similarity).
  • ...and 4 more figures

Theorems & Definitions (42)

  • Definition 2: Forgetting
  • Theorem 3
  • proof : Proof for Theorem \ref{['thm:main']}
  • Remark 4: Ease of notation
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • Remark 7: Explaining proof techniques
  • Lemma 8
  • ...and 32 more