A Statistical Theory of Regularization-Based Continual Learning
Xuyang Zhao, Huiyuan Wang, Weiran Huang, Wei Lin
TL;DR
This work analyzes regularization-based continual learning for a sequence of linear regression tasks with memory constraints. By introducing a generalized $\ell_2$-regularization (GR) framework and carefully chosen matrix-valued hyperparameters, the authors demonstrate that the estimation error can match the oracle rate, effectively mitigating catastrophic forgetting across tasks with heterogeneous data. They establish explicit per-coordinate error dynamics, reveal an intrinsic link between GR and early stopping, and provide a practical algorithm that adapts regularization to task information through covariances. Moreover, the results connect to existing methods (e.g., EWC) and show that appropriate hyperparameter choices balance forward and backward knowledge transfer, yielding strong theoretical guarantees and empirical support. Extensions to other loss functions and relaxations of assumptions are discussed, underscoring the broad relevance of the approach for continual learning in high-dimensional, multi-task settings.
Abstract
We provide a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks, with emphasis on how different regularization terms affect the model performance. We first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously. Next, we consider a family of generalized $\ell_2$-regularization algorithms indexed by matrix-valued hyperparameters, which includes the minimum norm estimator and continual ridge regression as special cases. As more tasks are introduced, we derive an iterative update formula for the estimation error of generalized $\ell_2$-regularized estimators, from which we determine the hyperparameters resulting in the optimal algorithm. Interestingly, the choice of hyperparameters can effectively balance the trade-off between forward and backward knowledge transfer and adjust for data heterogeneity. Moreover, the estimation error of the optimal algorithm is derived explicitly, which is of the same order as that of the oracle estimator. In contrast, our lower bounds for the minimum norm estimator and continual ridge regression show their suboptimality. A byproduct of our theoretical analysis is the equivalence between early stopping and generalized $\ell_2$-regularization in continual learning, which may be of independent interest. Finally, we conduct experiments to complement our theory.
