Table of Contents
Fetching ...

Elastic Weight Consolidation Done Right for Continual Learning

Xuan Liu, Xiaobin Chang

Abstract

Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance. In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC's reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios. Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variants exhibit fundamental misalignments in estimating weight importance, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies EWC's importance estimation. Specifically, reversing the logit values during the calculation of FIM can effectively prevent both gradient vanishing and redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).

Elastic Weight Consolidation Done Right for Continual Learning

Abstract

Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance. In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC's reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios. Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variants exhibit fundamental misalignments in estimating weight importance, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies EWC's importance estimation. Specifically, reversing the logit values during the calculation of FIM can effectively prevent both gradient vanishing and redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).
Paper Structure (29 sections, 26 equations, 7 figures, 8 tables)

This paper contains 29 sections, 26 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of the EWC learning process. After training on task $t-1$, model weights $\theta^{t-1}$ are obtained. The gradient of the cross-entropy loss $\mathcal{L}_{CE}$ is computed using the dataset $\mathcal{D}^{t-1}$ with the model $f(\cdot,\theta^{t-1})$. These gradients estimate the weight importance matrix $\Omega^{t-1}$, but they are not used to update the weights. When learning a new task $t$, a regularization term $\mathcal{L}_{reg}$ based on $\Omega^{t-1}$ is added into the total loss to constrain changes and preserve important parameters in $\theta^{t-1}$.
  • Figure 2: A Case Study of Gradient Vanishing.
  • Figure 3: A Case Study of Redundant Protection.
  • Figure 4: CD diagram comparing the $A_{avg}$ of different methods across all evaluated EFCIL settings. The CD is 1.438 with a significance level of 0.05.
  • Figure 5: Importance matrices of the FC layer weights (ground truth at class 2) for EWCs.
  • ...and 2 more figures