The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model
Daniel Goldfarb, Itay Evron, Nir Weinberger, Daniel Soudry, Paul Hand
TL;DR
This work tackles catastrophic forgetting in continual learning by analyzing a two-task linear regression model where the second task is a random orthogonal transform of the first, parameterized by the DOTS-derived similarity $\alpha = m/p$ and overparameterization $\beta = 1 - d/p$. The authors derive an exact non-asymptotic expression for the worst-case forgetting under sequential SGD, revealing a non-monotonic dependence on similarity in highly overparameterized regimes and a monotone relationship near the interpolation threshold. They provide a detailed proof sketch using Haar-orthogonal integrals, and validate the theory with synthetic linear regression and neural-network experiments on permutation benchmarks, including permuted MNIST. The results demonstrate that overparameterization does not always prevent forgetting and that the joint interaction with task similarity can produce a peak in forgetting at intermediate similarity when the model is highly overparameterized. The work offers a principled framework (with explicit $\alpha$ and $\beta$ dependencies) that informs the design of continual-learning systems and bridges theory with neural-network experiments, while outlining clear paths for extending to nonlinear models and more tasks.
Abstract
In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or overparameterization. In contrast, our paper examines how task similarity and overparameterization jointly affect forgetting in an analyzable model. Specifically, we focus on two-task continual linear regression, where the second task is a random orthogonal transformation of an arbitrary first task (an abstraction of random permutation tasks). We derive an exact analytical expression for the expected forgetting - and uncover a nuanced pattern. In highly overparameterized models, intermediate task similarity causes the most forgetting. However, near the interpolation threshold, forgetting decreases monotonically with the expected task similarity. We validate our findings with linear regression on synthetic data, and with neural networks on established permutation task benchmarks.
