Table of Contents
Fetching ...

Soup to go: mitigating forgetting during continual learning with model averaging

Anat Kleiman, Gintare Karolina Dziugaite, Jonathan Frankle, Sham Kakade, Mansheej Paul

TL;DR

Catastrophic forgetting in sequential task fine-tuning is mitigated by Sequential Fine-tuning Averaging (SFA), a data-free model-merging approach that periodically averages the current training state with an earlier checkpoint. SFA formalizes parameter updates as θ_{t+1} = (1−β) θ_t^* + β θ_o after every interval of pT steps and at the end, linking its behavior to L2-regularization and Bayesian interpretations. Across both image and language tasks, SFA matches or surpasses data-buffer baselines and outperforms other merging methods, with stronger past-task retention when averaging occurs during training (p<1). The approach reduces memory and computation compared to rehearsal while providing insights into the role of continual averaging in learning dynamics, and shows applicability to diverse domains including Law, Math, and Code in LLM fine-tuning contexts. Overall, SFA offers a practical, scalable alternative for continual learning with strong empirical performance and solid theoretical intuition.

Abstract

In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.

Soup to go: mitigating forgetting during continual learning with model averaging

TL;DR

Catastrophic forgetting in sequential task fine-tuning is mitigated by Sequential Fine-tuning Averaging (SFA), a data-free model-merging approach that periodically averages the current training state with an earlier checkpoint. SFA formalizes parameter updates as θ_{t+1} = (1−β) θ_t^* + β θ_o after every interval of pT steps and at the end, linking its behavior to L2-regularization and Bayesian interpretations. Across both image and language tasks, SFA matches or surpasses data-buffer baselines and outperforms other merging methods, with stronger past-task retention when averaging occurs during training (p<1). The approach reduces memory and computation compared to rehearsal while providing insights into the role of continual averaging in learning dynamics, and shows applicability to diverse domains including Law, Math, and Code in LLM fine-tuning contexts. Overall, SFA offers a practical, scalable alternative for continual learning with strong empirical performance and solid theoretical intuition.

Abstract

In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Paper Structure (22 sections, 15 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 15 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: A comparison of ViT (base) fine-tuned on a sequence of 20 tasks from Food-101 (left) and CIFAR-100 (right) using various continual learning techniques. Across both datasets, using SFA with varying p results in a high final average accuracy across all tasks (y-axis) comparable to using a data buffer. Furthermore, averaging during training ($p < 1$) achieves higher performance than only once at the end ($p=1$).
  • Figure 2: A comparison of sequentially fine-tuning ViT (base) on 20 tasks (Food-101) with (bottom) and without SFA (top). Each new task is introduced with a different colored curve across gradient timesteps (x-axis) resulting in changes to both current and past task accuracies (y-axis). The use of SFA can be seen to improve cumulative past task performance at averaging steps.
  • Figure 3: A comparison of Llama 2 (7B)'s performance on Math (y-axis) and Law (x-axis) using various fine-tuning and model merging techniques. The results are contained by dashed boundary boxes: the left and bottom lines represent the performance of a pretrained Llama 2 (7B) on Math and Law, whereas the right and top lines represent the performance of Llama 2 (7B) after fine-tuning on Law and Math respectively. A curve shows the performance of SFA with varying $p$, next to comparisons of continual learning with a data buffer, Task Arithmetic, and TIES. Finally, we also show an initial model (fine-tuned on math) and performance after sequentially fine-tuning it on Law.
  • Figure 4: A comparison of Qwen2.5 (1.5B)'s performance on Math, Law using various fine-tuning and model merging techniques similar to \ref{['fig:llama7b_mathlaw']}. On Math to Law, SFA $p=0.25$ can be seen as having comparable performance to using a data buffer with 5% past task data, while outperforming Task Arithmetic, which resembles fine-tuning with no intervention and WiSE-FT in performance.
  • Figure 5: A comparison of Pythia (2.8B)'s performance on multiple domains (Math, Law and Math, Code) using various fine-tuning and model merging techniques similar to \ref{['fig:llama7b_mathlaw']}. On Math to Law, SFA $p=0.25$ can be seen as having comparable performance to using a data buffer, while outperforming Task Arithmetic. Likewise, in Math to Code, SFA with varying $p$ outperform using a data buffer and Task Arithmetic.
  • ...and 7 more figures