Table of Contents
Fetching ...

BECAME: BayEsian Continual Learning with Adaptive Model MErging

Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, Hongtao Lu

TL;DR

The paper tackles catastrophic forgetting in continual learning by marrying gradient projection with adaptive model merging under a Bayesian lens. It proves there exists a merging point along the line between old and new task parameters that can reduce cumulative loss and derives a closed-form optimal merging coefficient via a Laplace (MAP) approximation, computable with Fisher information. The proposed two-stage BECAME framework first stabilizes learning with gradient projection and then enhances plasticity through unconstrained retraining before merging; this yields state-of-the-art performance on several CL benchmarks with improved plasticity and stable retention. Overall, the work provides a principled, generalizable mechanism to balance stability and plasticity in continual learning and offers practical guidance for integrating adaptive merging into existing CL methods.

Abstract

Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.

BECAME: BayEsian Continual Learning with Adaptive Model MErging

TL;DR

The paper tackles catastrophic forgetting in continual learning by marrying gradient projection with adaptive model merging under a Bayesian lens. It proves there exists a merging point along the line between old and new task parameters that can reduce cumulative loss and derives a closed-form optimal merging coefficient via a Laplace (MAP) approximation, computable with Fisher information. The proposed two-stage BECAME framework first stabilizes learning with gradient projection and then enhances plasticity through unconstrained retraining before merging; this yields state-of-the-art performance on several CL benchmarks with improved plasticity and stable retention. Overall, the work provides a principled, generalizable mechanism to balance stability and plasticity in continual learning and offers practical guidance for integrating adaptive merging into existing CL methods.

Abstract

Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.

Paper Structure

This paper contains 25 sections, 1 theorem, 28 equations, 13 figures, 13 tables, 1 algorithm.

Key Result

Lemma 3.1

Given model parameters $\theta_{t-1}^*$, which minimize loss $\mathcal{L}_{1:t-1}$, and new parameters $\hat{\theta}_t$ optimized from $\theta_{t-1}^*$ to local minima of $\mathcal{L}_t$, there exists a coefficient $\lambda \in [0,1]$ satisfying that

Figures (13)

  • Figure 1: (Left)The training loss landscape for task 1 and task 2, represented by red and blue contours, respectively. Darker color denotes lower loss value. (Right)The training loss values for task 1, task 2, and their sum. Both figures are based on the NSCL-based experiment on the 10-split CIFAR-100 dataset. The model initially learns task 1, reaching $\theta_1^*$, which minimizes the loss $\mathcal{L}_1$. When learning task 2, the model first obtains $\theta_2^{\text{GP}}$ using the gradient projection, with minor forgetting indicated by the increase in $\mathcal{L}_1$. This solution shows limited plasticity, as $\mathcal{L}_2$ remains high. The model then proceeds to train without constraints, reaching $\hat{\theta}_2$, the minimum of $\mathcal{L}_2$. By analyzing the trajectory from $\theta_2^{\text{GP}}$ to $\hat{\theta}_2$, our method can determine the optimal merging coefficient, achieving the minimal cumulative loss at $\theta_2^*$.
  • Figure 2: Visualization of the training loss and test accuracy landscapes for task 1, task 2, cumulative training loss, and average test accuracy. The figures are derived from the NSCL-based experiment conducted on the 10-split CIFAR-100 dataset. The model is initially trained on task 1, starting from a random initialization to obtain $\theta_1^*$. Next, the model undergoes training on task 2 with gradient projection, yielding $\theta_2^{\text{GP}}$. Further training on task 2 is then performed without constraints, resulting in $\hat{\theta}_2$. Our adaptive method then merges $\theta_2^{\text{GP}}$ and $\hat{\theta}_2$ to locate the optimal point that minimizes training loss along the trajectory from $\theta_2^{\text{GP}}$ to $\hat{\theta}_2$.
  • Figure 3: BWT and IM in GPM-based and NSCL-based experiments. Our method outperforms baselines by reducing IM (plasticity, lower is better) while maintaining a good BWT (stability, higher is better).
  • Figure 4: Final test accuracy of each task in GPM-based and NSCL-based experiments. Our method simultaneously enhances the accuracy and balance across different tasks (STD).
  • Figure 5: After one epoch accuracy (AOA) in GPM-based and NSCL-based experiments. Note that the training process is the same until the second stage of task 2. Our methods demonstrate superior generalization for learning new tasks compared to the baselines.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Lemma 3.1
  • proof
  • proof