Table of Contents
Fetching ...

MagMax: Leveraging Model Merging for Seamless Continual Learning

Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzciński, Sebastian Cygert

TL;DR

The paper tackles continual learning with large pre-trained models by addressing catastrophic forgetting through a merging-based paradigm. It introduces MagMax, which sequentially fine-tunes on new tasks and then consolidates knowledge by selecting the maximum-magnitude parameter updates across task-specific vectors, yielding $\theta_{MagMax} = \theta_0 + \lambda \tau_{MagMax}$ where $\tau_{MagMax}^p = \tau_k^p$ with $k = \arg\max_i |\tau_i^p|$. Through extensive benchmarking across class- and domain-incremental settings, MagMax achieves state-of-the-art results on multiple benchmarks and reveals that simple baselines (e.g., averaging or random mixing) are unexpectedly strong in certain regimes. The study also provides deep insights into the role of update magnitude, sign-consistency, and task-vector contributions, while showing that sequential fine-tuning enhances other merging methods and that a fixed scaling factor $\lambda$ is largely robust. Overall, MagMax demonstrates a practical, memory-efficient route to robust continual learning for large pre-trained models with broad implications for improving open-vocabulary and cross-domain adaptation.

Abstract

This paper introduces a continual learning approach named MagMax, which utilizes model merging to enable large pre-trained models to continuously learn from new data without forgetting previously acquired knowledge. Distinct from traditional continual learning methods that aim to reduce forgetting during task training, MagMax combines sequential fine-tuning with a maximum magnitude weight selection for effective knowledge integration across tasks. Our initial contribution is an extensive examination of model merging techniques, revealing that simple approaches like weight averaging and random weight selection surprisingly hold up well in various continual learning contexts. More importantly, we present MagMax, a novel model-merging strategy that enables continual learning of large pre-trained models for successive tasks. Our thorough evaluation demonstrates the superiority of MagMax in various scenarios, including class- and domain-incremental learning settings. The code is available at this URL: https://github.com/danielm1405/magmax.

MagMax: Leveraging Model Merging for Seamless Continual Learning

TL;DR

The paper tackles continual learning with large pre-trained models by addressing catastrophic forgetting through a merging-based paradigm. It introduces MagMax, which sequentially fine-tunes on new tasks and then consolidates knowledge by selecting the maximum-magnitude parameter updates across task-specific vectors, yielding where with . Through extensive benchmarking across class- and domain-incremental settings, MagMax achieves state-of-the-art results on multiple benchmarks and reveals that simple baselines (e.g., averaging or random mixing) are unexpectedly strong in certain regimes. The study also provides deep insights into the role of update magnitude, sign-consistency, and task-vector contributions, while showing that sequential fine-tuning enhances other merging methods and that a fixed scaling factor is largely robust. Overall, MagMax demonstrates a practical, memory-efficient route to robust continual learning for large pre-trained models with broad implications for improving open-vocabulary and cross-domain adaptation.

Abstract

This paper introduces a continual learning approach named MagMax, which utilizes model merging to enable large pre-trained models to continuously learn from new data without forgetting previously acquired knowledge. Distinct from traditional continual learning methods that aim to reduce forgetting during task training, MagMax combines sequential fine-tuning with a maximum magnitude weight selection for effective knowledge integration across tasks. Our initial contribution is an extensive examination of model merging techniques, revealing that simple approaches like weight averaging and random weight selection surprisingly hold up well in various continual learning contexts. More importantly, we present MagMax, a novel model-merging strategy that enables continual learning of large pre-trained models for successive tasks. Our thorough evaluation demonstrates the superiority of MagMax in various scenarios, including class- and domain-incremental learning settings. The code is available at this URL: https://github.com/danielm1405/magmax.
Paper Structure (33 sections, 14 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of the proposed MagMax method for continual learning. We sequentially fine-tune the model on the subsequent tasks and create task vectors $\tau_i$ by subtracting the weights of the pre-trained model $\theta_0$. Then we merge the task vectors using Maximum Magnitude Selection strategy which selects the parameters of task vectors by highest magnitude. Finally, we apply merged task vector to the pre-trained model to obtain a multitask model $\theta_{\textsc{MagMax}}$. Note that with running statistics implementation we can only store two sets of weights (see Section \ref{['sec:mem_footprint']} for details).
  • Figure 2: Only a small fraction of parameters that changed the most during fine-tuning is responsible for improved performance.
  • Figure 2: MagMax outperforms other merging-based methods in domain-incremental scenarios and achieves similar results to CL methods. We report task-agnostic accuracy (%) after the final task. The best results are in bold and the second best underlined.
  • Figure 3: Sequential fine-tuning encourages consistent directions of parameter updates. We report sign conflicts after trimming 80% of the lowest magnitude parameters in each task vector.
  • Figure 4: Sequential fine-tuning (left) exhibits high forgetting. Merging independent fine-tunings significantly reduces the forgetting (middle). MagMax further improves this issue (right). We present the results on already learned tasks in orange and zero-shot performance in blue. We report task-agnostic accuracy (%) for each task (columns) after training on the subsequent tasks (rows). The last column is an average accuracy on already seen tasks (lower triangular matrix in orange).
  • ...and 9 more figures