Table of Contents
Fetching ...

Understanding Forgetting in Continual Learning with Linear Regression

Meng Ding, Kaiyi Ji, Di Wang, Jinhui Xu

TL;DR

This work analyzes forgetting in continual learning for sequential linear regression tasks under SGD, covering both underparameterized and overparameterized regimes. By deriving upper and nearly matching lower bounds that decompose error into variance and bias terms, the authors show that forgetting depends on the eigen-spectrum of task covariances, the step size, and the number of samples per task, and that training orders placing large-eigenvalue tasks later can increase forgetting when data size is large. The results introduce and rely on a fourth-moment covariate condition and define effective dimensions that capture projection effects across tasks, providing insights into when forgetting can vanish in overparameterized settings with appropriate spectral decay and step sizes. Empirical simulations on linear models and DNNs corroborate the theory, demonstrating the practical impact of eigenvalue sequencing and learning rate on forgetting, and offering guidance for designing robust continual-learning systems beyond Gaussian or minimum-norm assumptions.

Abstract

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both underparameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both underparameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.

Understanding Forgetting in Continual Learning with Linear Regression

TL;DR

This work analyzes forgetting in continual learning for sequential linear regression tasks under SGD, covering both underparameterized and overparameterized regimes. By deriving upper and nearly matching lower bounds that decompose error into variance and bias terms, the authors show that forgetting depends on the eigen-spectrum of task covariances, the step size, and the number of samples per task, and that training orders placing large-eigenvalue tasks later can increase forgetting when data size is large. The results introduce and rely on a fourth-moment covariate condition and define effective dimensions that capture projection effects across tasks, providing insights into when forgetting can vanish in overparameterized settings with appropriate spectral decay and step sizes. Empirical simulations on linear models and DNNs corroborate the theory, demonstrating the practical impact of eigenvalue sequencing and learning rate on forgetting, and offering guidance for designing robust continual-learning systems beyond Gaussian or minimum-norm assumptions.

Abstract

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both underparameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both underparameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.
Paper Structure (21 sections, 12 theorems, 91 equations, 1 figure)

This paper contains 21 sections, 12 theorems, 91 equations, 1 figure.

Key Result

Theorem 3.1

Consider a scenario where the model $\mathbf{w}$ undergoes training via SGD for $M$ distinct tasks, following a sequence $1, \ldots, M$. With a constant step size of $\eta \leq 1/R^2$ given that $R^2 = \max \{\alpha_m \operatorname{tr}(\mathbf{H}_m)\}_{m=1}^{M}$, each task $m$ is executed for $N$ it where the variance and bias errors are upper-bounded by where the effective dimensions are given b

Figures (1)

  • Figure 1: Impact of Task Sequence and Algorithmic Parameters on Forgetting Behavior with Linear Regression Model and Deep Neural Networks. This figure presents the relationship between task sequence order and algorithmic parameters (data size, dimensionality, and step size) on the forgetting behavior observed in linear regression models (figures (a)-(f)) and deep neural networks (figures (g)-(l)). Figures (a), (b), (i), and (j) illustrate how varying data sizes impact forgetting behavior for different task sequences, while figures (c), (d), (k), and (l) demonstrate the effect of changing dimensionality on forgetting. Lastly, Figures (e)-(h) demonstrate the influence of stepsize on the rate of forgetting across different model configurations.

Theorems & Definitions (20)

  • Definition 2.1: Data Covariance
  • Definition 2.2: Covariate Shift
  • Remark 1
  • Theorem 3.1: Upper Bound
  • Theorem 3.2: Lower Bound
  • Lemma 1.1: zou2021benign
  • Lemma 1.2: Bias-variance decomposition
  • Lemma 2.2
  • proof : Proof
  • Lemma 2.3
  • ...and 10 more