Memory-Statistics Tradeoff in Continual Learning with Structural Regularization
Haoran Li, Jingfeng Wu, Vladimir Braverman
TL;DR
This paper addresses forgetting in continual learning by studying a two-task linear regression problem under covariate shift and introducing generalized L2 structural regularization (GRCL) that stores a low-rank PSD matrix Sigma to constrain updates after the first task. It derives sharp risk bounds in both one-hot and Gaussian designs, revealing a fundamental memory-statistics trade-off: larger memory (more basis vectors in Sigma) improves statistical efficiency but increases memory cost, while smaller memory reduces memory use but harms excess risk. The results show OCL and L2-RCL suffer forgetting in low-memory regimes, whereas GRCL can match joint-training performance when enough memory is allocated, and reduces forgetting even in moderately large memory settings; this highlights curvature-aware regularization as a key tool for effective continual learning. The work also offers practical algorithmic guidance for constructing Sigma and validates the theory with numerical experiments, indicating broad implications for memory-constrained learning systems."
Abstract
We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting. We consider a structural regularization algorithm that incorporates a generalized $\ell_2$-regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting. We establish upper and lower bounds on the joint excess risk for this algorithm. Our analysis reveals a fundamental trade-off between memory complexity and statistical efficiency, where memory complexity is measured by the number of vectors needed to define the structural regularization. Specifically, increasing the number of vectors in structural regularization leads to a worse memory complexity but an improved excess risk, and vice versa. Furthermore, our theory suggests that naive continual learning without regularization suffers from catastrophic forgetting, while structural regularization mitigates this issue. Notably, structural regularization achieves comparable performance to joint training with access to both tasks simultaneously. These results highlight the critical role of curvature-aware regularization for continual learning.
