Table of Contents
Fetching ...

Memory-Statistics Tradeoff in Continual Learning with Structural Regularization

Haoran Li, Jingfeng Wu, Vladimir Braverman

TL;DR

This paper addresses forgetting in continual learning by studying a two-task linear regression problem under covariate shift and introducing generalized L2 structural regularization (GRCL) that stores a low-rank PSD matrix Sigma to constrain updates after the first task. It derives sharp risk bounds in both one-hot and Gaussian designs, revealing a fundamental memory-statistics trade-off: larger memory (more basis vectors in Sigma) improves statistical efficiency but increases memory cost, while smaller memory reduces memory use but harms excess risk. The results show OCL and L2-RCL suffer forgetting in low-memory regimes, whereas GRCL can match joint-training performance when enough memory is allocated, and reduces forgetting even in moderately large memory settings; this highlights curvature-aware regularization as a key tool for effective continual learning. The work also offers practical algorithmic guidance for constructing Sigma and validates the theory with numerical experiments, indicating broad implications for memory-constrained learning systems."

Abstract

We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting. We consider a structural regularization algorithm that incorporates a generalized $\ell_2$-regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting. We establish upper and lower bounds on the joint excess risk for this algorithm. Our analysis reveals a fundamental trade-off between memory complexity and statistical efficiency, where memory complexity is measured by the number of vectors needed to define the structural regularization. Specifically, increasing the number of vectors in structural regularization leads to a worse memory complexity but an improved excess risk, and vice versa. Furthermore, our theory suggests that naive continual learning without regularization suffers from catastrophic forgetting, while structural regularization mitigates this issue. Notably, structural regularization achieves comparable performance to joint training with access to both tasks simultaneously. These results highlight the critical role of curvature-aware regularization for continual learning.

Memory-Statistics Tradeoff in Continual Learning with Structural Regularization

TL;DR

This paper addresses forgetting in continual learning by studying a two-task linear regression problem under covariate shift and introducing generalized L2 structural regularization (GRCL) that stores a low-rank PSD matrix Sigma to constrain updates after the first task. It derives sharp risk bounds in both one-hot and Gaussian designs, revealing a fundamental memory-statistics trade-off: larger memory (more basis vectors in Sigma) improves statistical efficiency but increases memory cost, while smaller memory reduces memory use but harms excess risk. The results show OCL and L2-RCL suffer forgetting in low-memory regimes, whereas GRCL can match joint-training performance when enough memory is allocated, and reduces forgetting even in moderately large memory settings; this highlights curvature-aware regularization as a key tool for effective continual learning. The work also offers practical algorithmic guidance for constructing Sigma and validates the theory with numerical experiments, indicating broad implications for memory-constrained learning systems."

Abstract

We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting. We consider a structural regularization algorithm that incorporates a generalized -regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting. We establish upper and lower bounds on the joint excess risk for this algorithm. Our analysis reveals a fundamental trade-off between memory complexity and statistical efficiency, where memory complexity is measured by the number of vectors needed to define the structural regularization. Specifically, increasing the number of vectors in structural regularization leads to a worse memory complexity but an improved excess risk, and vice versa. Furthermore, our theory suggests that naive continual learning without regularization suffers from catastrophic forgetting, while structural regularization mitigates this issue. Notably, structural regularization achieves comparable performance to joint training with access to both tasks simultaneously. These results highlight the critical role of curvature-aware regularization for continual learning.

Paper Structure

This paper contains 48 sections, 28 theorems, 168 equations, 2 figures.

Key Result

Proposition 1

Suppose Assumptions assump:noise, assump:commutable and assump:one-hot hold. Denote $\mathbb{J}, \mathbb{K}$ such that $\mathbb{J} = \{ i: \mu_i > \frac{1}{n} \}$, and $\mathbb{K} = \{ i: \lambda_i > \frac{1}{n} \}$. Then for $\bm{w}_\mathtt{joint}$ given by eq:joint-learning, where

Figures (2)

  • Figure 1: Expected excess risk vs. \ref{['fig:change-n']} sample size $n$, and \ref{['fig:change-k']} memory size $k$ for generalized $\ell_2$-regularized continual learning (GRCL), compared with joint learning (JL) and ordinary continual learning (OCL). For each point in each curve, the y-axis represents the expected CL excess risk (in the logarithmic scale). The problem instances $\mathtt{P}(15)$ is defined by \ref{['eqn:setting-p']}. The dimension of the task is $d=200$. The sample size is fixed at $n=5000$ in \ref{['fig:change-k']}. The excess risk is computed by taking an empirical average over $20$ independent runs.
  • Figure 2: Average accuracy across previously learned tasks on \ref{['fig:perm-mnist']} Permuted MNIST and \ref{['fig:rotated-mnist']} Rotated MNIST after each epoch of training for the vanilla algorithm without regularization, the regularization-based method with full Hessian, and with low-rank regularization. In both experiments, we use the Adam optimizer with a learning rate of $10^{-4}$. The moving average parameter is $\alpha=0.25$ and the regularization coefficient is $10^{4}$ for all algorithms.

Theorems & Definitions (65)

  • Definition 1: Covariance conditions
  • Definition 2: Model conditions
  • Remark 1: OCL in the interpolation regime
  • Proposition 1
  • Theorem 2
  • Corollary 3
  • Corollary 4
  • Example 5
  • Corollary 6
  • Example 7
  • ...and 55 more