Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

Anton Alexandrov; Veselin Raychev; Mark Niklas Müller; Ce Zhang; Martin Vechev; Kristina Toutanova

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

Anton Alexandrov, Veselin Raychev, Mark Niklas Müller, Ce Zhang, Martin Vechev, Kristina Toutanova

TL;DR

This paper tackles catastrophic forgetting during language transfer by introducing Branch-and-Merge (BaM), a continual-learning-inspired method that splits training data into slices, trains multiple models in parallel on subsets, and merges them to form a new base for subsequent iterations. BaM reduces weight-change magnitude and produces higher-quality updates, thereby preserving base capabilities while enabling learning in a new language. The authors augment BaM with a carefully designed data mix, including approximate experience replay and targeted Bulgarian and German data, and evaluate on Bulgarian and German benchmarks across continued pretraining and instruction tuning, using Mistral7B-7B-v0 and Llama3-38B-8B as bases. Results show BaM consistently reduces forgetting and maintains or improves target-language performance relative to standard CPT and IFT, with ablations highlighting the importance of data replay quality, merge strategies, and the balance of learning and forgetting. The work demonstrates practical gains for multilingual adaptation and points to broader applicability in continual learning scenarios, while acknowledging limitations in scope and the need for broader evaluation.

Abstract

As open-weight large language models (LLMs) achieve ever more impressive performances across a wide range of tasks in English, practitioners aim to adapt these models to different languages. However, such language adaptation is often accompanied by catastrophic forgetting of the base model's capabilities, severely limiting the usefulness of the resulting model. We address this issue by proposing Branch-and-Merge (BaM), a new adaptation method based on iteratively merging multiple models, fine-tuned on a subset of the available training data. BaM is based on the insight that this yields lower magnitude but higher quality weight changes, reducing forgetting of the source domain while maintaining learning on the target domain. We demonstrate in an extensive empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance compared to both standard continued pretraining and instruction finetuning across different model architectures.

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

TL;DR

Abstract

Paper Structure (74 sections, 1 equation, 7 figures, 17 tables, 1 algorithm)

This paper contains 74 sections, 1 equation, 7 figures, 17 tables, 1 algorithm.

Introduction
Catastrophic Forgetting
Experience Replay
This Work: Mitigating Catastrophic Forgetting with Branch-and-Merge
Results
Key Contributions
Model Merging
Branch-and-Merge for Mitigating Forgetting in Language Transfer
Intuition
Implementation
Data Mixtures for Mitigating Forgetting in Language Transfer
Approximate Experience Replay of Source Domain Data
Minimal Experience Replay of Source Domain Data
Constructing Target Language Data
Bulgarian Training Data
...and 59 more sections

Figures (7)

Figure 1: Illustration of Branch-and-Merge (BaM). We first split the training data into $N$ slices (blue ). We then iteratively finetune the current base model on two of these slices (green ) and merge the resulting models to obtain the base model for the next iteration (purple ). We repeat this until all $N$ data slices have been used.
Figure 2: Illustration of BaM in the loss surface over parameter space. Both $\theta_1$ and $\theta_2$ land in poor local minima but their merge $\theta'_{1,2}$ lies in the valley of a better minimum. Training from there, $\theta_3$ and $\theta_4$ land at the boundary of that minimum due to noise in the training process and limited data. Their merge $\theta'_{3,4}$ cancels these errors and lies in the better minimum.
Figure 3: Comparing minimal and our approximate experience replay on Mistral--v0. with respect to average Bulgarian benchmark scores ($\uparrow$) and Negative Log-Likelihood (NLL) on the English validation set ($\downarrow$).
Figure 4: Average Bulgarian benchmark score ($\uparrow$) and English NLL ($\downarrow$) over L2 norm of weight change depending on training method for Llama3-3-
Figure 5: L2 norm of weight change depending on training method for Llama3-3-
...and 2 more figures

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

TL;DR

Abstract

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

Authors

TL;DR

Abstract

Table of Contents

Figures (7)