Revisiting Weight Regularization for Low-Rank Continual Learning

Yaoyue Zheng; Yin Zhang; Joost van de Weijer; Gido M van de Ven; Shaoyi Du; Xuetao Zhang; Zhiqiang Tian

Revisiting Weight Regularization for Low-Rank Continual Learning

Yaoyue Zheng, Yin Zhang, Joost van de Weijer, Gido M van de Ven, Shaoyi Du, Xuetao Zhang, Zhiqiang Tian

TL;DR

This paper revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL, and proposes EWC-LoRA, a practical, computational- and memory-efficient solution for CL with PTMs that achieves a stability-plasticity trade-off superior to existing low-rank CL approaches.

Abstract

Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)-a key strategy in CL-remain underexplored in this new paradigm. In this paper, we revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low-rank CL methods, we mitigate task interference by regularizing a shared low-rank update through EWC, thereby keeping the storage requirement and inference costs constant regardless of the number of tasks. Our proposed method EWC-LoRA leverages a low-rank representation to estimate parameter importance over the full-dimensional space. This design offers a practical, computational- and memory-efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC-LoRA, achieving a stability-plasticity trade-off superior to existing low-rank CL approaches. These results indicate that, even under low-rank parameterizations, weight regularization remains an effective mechanism for mitigating task interference. Code is available at: https://github.com/yaoyz96/low-rank-cl.

Revisiting Weight Regularization for Low-Rank Continual Learning

TL;DR

Abstract

Paper Structure (54 sections, 3 theorems, 33 equations, 13 figures, 15 tables, 1 algorithm)

This paper contains 54 sections, 3 theorems, 33 equations, 13 figures, 15 tables, 1 algorithm.

Introduction
Related Works
Continual Learning (CL).
Parameter-Efficient Continual Learning (PECL).
Elastic Weight Consolidation (EWC).
Methodology
Preliminaries
Notations.
Problem Formulation.
Elastic Weight Consolidation.
Low-rank Adaptation (LoRA).
EWC with Low-rank Adaptation
Overview of EWC-LoRA
Experiments
Benchmarks
...and 39 more sections

Key Result

Proposition 1

Let $\Delta \mathbf{W} \in \mathbb{R}^{d_{O} \times d_{I}}$ be a model parameter matrix factorized as $\Delta \mathbf{W} = \mathbf{A}\mathbf{B}$, with $\mathbf{A} \in \mathbb{R}^{d_{O} \times r}$, $\mathbf{B} \in \mathbb{R}^{r \times d_{O}}$. Define the EWC regularization term in the full-space as $

Figures (13)

Figure 1: Overview of learning task $\mathcal{T}_t$ at a specific layer of the ViT model. (a) Prior low-rank CL methods structurally isolate task-specific LoRA parameters by adding a new LoRA branch for each task. (b) The proposed EWC-LoRA employs a shared LoRA module that is learned across all tasks and regularized according to parameter importance measured by a Fisher Information Matrix, which is updated after learning each task.
Figure 2: Task-wise performance comparison of different methods across various datasets.
Figure 3: (a) Stability–Plasticity curves illustrating the trade-off between retaining previous knowledge and learning new tasks. (b) Performance across a range of regularization strengths $\lambda$ on CIFAR-100 and DomainNet, showing the effect of $\lambda$ on accuracy.
Figure 4: Stability-Plasticity trade-off with various task decay factor $\gamma$. A broad range of $\gamma$ (0.3–0.9) yields a similar trade-off, indicating that the method is not sensitive to the precise choice of $\gamma$.
Figure 5: Task-wise performance on CIFAR-100 under different $\gamma$ settings.
...and 8 more figures

Theorems & Definitions (3)

Proposition 1
Proposition 2
Proposition 3

Revisiting Weight Regularization for Low-Rank Continual Learning

TL;DR

Abstract

Revisiting Weight Regularization for Low-Rank Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (3)