Table of Contents
Fetching ...

Weighted Ensemble Models Are Strong Continual Learners

Imad Eddine Marouf, Subhankar Roy, Enzo Tartaglione, Stéphane Lathuilière

TL;DR

The paper tackles catastrophic forgetting in class-incremental learning (CIL) with pretrained transformers by proposing Continual Model Averaging (CoMA), a simple weight-space ensemble that blends the previous-task model with the current-task fine-tuned model to preserve past knowledge while allowing new learning. It further improves this approach with Continual Fisher-weighted Model Averaging (CoFiMA), which uses the diagonal Fisher information to weight parameter contributions according to their importance to each task, enabling a more stable retention-plasticity balance. Across four benchmarks and both supervised and self-supervised pretraining, CoMA delivers strong gains over state-of-the-art PTM-based CL methods, while CoFiMA sets new records and approaches joint-training performance in several settings. The results demonstrate that a compact, post-hoc weight-averaging strategy can effectively mitigate forgetting in continual learning without extensive data rehearsal or architectural changes, making it practical for large pretrained models.

Abstract

In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). Intending to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current tasks. This weighted-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weights ensemble by leveraging the Fisher information of the weights of the model. Both variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks. Code is available at: https://github.com/IemProg/CoFiMA.

Weighted Ensemble Models Are Strong Continual Learners

TL;DR

The paper tackles catastrophic forgetting in class-incremental learning (CIL) with pretrained transformers by proposing Continual Model Averaging (CoMA), a simple weight-space ensemble that blends the previous-task model with the current-task fine-tuned model to preserve past knowledge while allowing new learning. It further improves this approach with Continual Fisher-weighted Model Averaging (CoFiMA), which uses the diagonal Fisher information to weight parameter contributions according to their importance to each task, enabling a more stable retention-plasticity balance. Across four benchmarks and both supervised and self-supervised pretraining, CoMA delivers strong gains over state-of-the-art PTM-based CL methods, while CoFiMA sets new records and approaches joint-training performance in several settings. The results demonstrate that a compact, post-hoc weight-averaging strategy can effectively mitigate forgetting in continual learning without extensive data rehearsal or architectural changes, making it practical for large pretrained models.

Abstract

In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). Intending to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current tasks. This weighted-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weights ensemble by leveraging the Fisher information of the weights of the model. Both variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks. Code is available at: https://github.com/IemProg/CoFiMA.
Paper Structure (25 sections, 14 equations, 7 figures, 5 tables)

This paper contains 25 sections, 14 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of existing model averaging techniques with our proposed technique for CL. (a) Averaging the weights of the pre-trained model and the one fine-tuned leads to simultaneous improvement in out-of-distribution and target dataset performance. (b) Models soups combines multiple fine-tuned models resulting in a robust unified model. (c) In the proposed CoFiMA the weights of the current and past models are weighted based on their Fisher Information matrices (represented by $\mathcal{F}$), resulting in balanced performance for both the current and old tasks.
  • Figure 2: Illustration of the parameter trajectory with model averaging in the loss landscape. (a) The trajectory where models are re-initialized at the start of each task leading to disparate solutions. (b) Depicts decreasing $\lambda\!=\!1/t$: Uniform Averaging without re-initialization, showing the convergence of model parameters towards a solution that balances between tasks. (c) Recent tasks are given more weight, resulting in a solution that remains close to the latest task's model while considering previous tasks.
  • Figure 3: CoFiMA with various PTMs on CIFAR-100. CoFiMA enhances the results of SLCA.
  • Figure 4: Ablation study on the effect of $\lambda$ on CIFAR-100 dataset. The red marker point represents the best performance.
  • Figure 5: Similarity scores using $L_2$ norm for CIFAR-100 (a) and Imagenet-R (b) datasets using CoFiMA and SLCA methods.
  • ...and 2 more figures