Table of Contents
Fetching ...

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

Anton Alexandrov, Veselin Raychev, Mark Niklas Müller, Ce Zhang, Martin Vechev, Kristina Toutanova

TL;DR

This paper tackles catastrophic forgetting during language transfer by introducing Branch-and-Merge (BaM), a continual-learning-inspired method that splits training data into slices, trains multiple models in parallel on subsets, and merges them to form a new base for subsequent iterations. BaM reduces weight-change magnitude and produces higher-quality updates, thereby preserving base capabilities while enabling learning in a new language. The authors augment BaM with a carefully designed data mix, including approximate experience replay and targeted Bulgarian and German data, and evaluate on Bulgarian and German benchmarks across continued pretraining and instruction tuning, using Mistral7B-7B-v0 and Llama3-38B-8B as bases. Results show BaM consistently reduces forgetting and maintains or improves target-language performance relative to standard CPT and IFT, with ablations highlighting the importance of data replay quality, merge strategies, and the balance of learning and forgetting. The work demonstrates practical gains for multilingual adaptation and points to broader applicability in continual learning scenarios, while acknowledging limitations in scope and the need for broader evaluation.

Abstract

As open-weight large language models (LLMs) achieve ever more impressive performances across a wide range of tasks in English, practitioners aim to adapt these models to different languages. However, such language adaptation is often accompanied by catastrophic forgetting of the base model's capabilities, severely limiting the usefulness of the resulting model. We address this issue by proposing Branch-and-Merge (BaM), a new adaptation method based on iteratively merging multiple models, fine-tuned on a subset of the available training data. BaM is based on the insight that this yields lower magnitude but higher quality weight changes, reducing forgetting of the source domain while maintaining learning on the target domain. We demonstrate in an extensive empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance compared to both standard continued pretraining and instruction finetuning across different model architectures.

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

TL;DR

This paper tackles catastrophic forgetting during language transfer by introducing Branch-and-Merge (BaM), a continual-learning-inspired method that splits training data into slices, trains multiple models in parallel on subsets, and merges them to form a new base for subsequent iterations. BaM reduces weight-change magnitude and produces higher-quality updates, thereby preserving base capabilities while enabling learning in a new language. The authors augment BaM with a carefully designed data mix, including approximate experience replay and targeted Bulgarian and German data, and evaluate on Bulgarian and German benchmarks across continued pretraining and instruction tuning, using Mistral7B-7B-v0 and Llama3-38B-8B as bases. Results show BaM consistently reduces forgetting and maintains or improves target-language performance relative to standard CPT and IFT, with ablations highlighting the importance of data replay quality, merge strategies, and the balance of learning and forgetting. The work demonstrates practical gains for multilingual adaptation and points to broader applicability in continual learning scenarios, while acknowledging limitations in scope and the need for broader evaluation.

Abstract

As open-weight large language models (LLMs) achieve ever more impressive performances across a wide range of tasks in English, practitioners aim to adapt these models to different languages. However, such language adaptation is often accompanied by catastrophic forgetting of the base model's capabilities, severely limiting the usefulness of the resulting model. We address this issue by proposing Branch-and-Merge (BaM), a new adaptation method based on iteratively merging multiple models, fine-tuned on a subset of the available training data. BaM is based on the insight that this yields lower magnitude but higher quality weight changes, reducing forgetting of the source domain while maintaining learning on the target domain. We demonstrate in an extensive empirical study on Bulgarian and German that BaM can significantly reduce forgetting while matching or even improving target domain performance compared to both standard continued pretraining and instruction finetuning across different model architectures.
Paper Structure (74 sections, 1 equation, 7 figures, 17 tables, 1 algorithm)

This paper contains 74 sections, 1 equation, 7 figures, 17 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Branch-and-Merge (BaM). We first split the training data into $N$ slices (blue ). We then iteratively finetune the current base model on two of these slices (green ) and merge the resulting models to obtain the base model for the next iteration (purple ). We repeat this until all $N$ data slices have been used.
  • Figure 2: Illustration of BaM in the loss surface over parameter space. Both $\theta_1$ and $\theta_2$ land in poor local minima but their merge $\theta'_{1,2}$ lies in the valley of a better minimum. Training from there, $\theta_3$ and $\theta_4$ land at the boundary of that minimum due to noise in the training process and limited data. Their merge $\theta'_{3,4}$ cancels these errors and lies in the better minimum.
  • Figure 3: Comparing minimal and our approximate experience replay on Mistral--v0. with respect to average Bulgarian benchmark scores ($\uparrow$) and Negative Log-Likelihood (NLL) on the English validation set ($\downarrow$).
  • Figure 4: Average Bulgarian benchmark score ($\uparrow$) and English NLL ($\downarrow$) over L2 norm of weight change depending on training method for Llama3-3-
  • Figure 5: L2 norm of weight change depending on training method for Llama3-3-
  • ...and 2 more figures