Table of Contents
Fetching ...

The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

Lucas Bandarkar, Nanyun Peng

TL;DR

This work probes how to transfer math reasoning and multilingual capabilities to low-resource languages in dense LLMs by inducing modularity in fine-tuning and via model merging. It demonstrates that math and language abilities can be allocated to distinct parameter subsets and that reassembling these via Layer-Swapping often yields stronger cross-lingual performance than joint multi-task fine-tuning. The study provides both empirical evidence and theoretical intuition—rooted in the linearity of task vectors $oldsymbol{ abla}$ and the effectiveness of training with subsequent resets—that explain why Layer-Swapping works well and why train-then-revert beats freeze-then-train. The findings offer practical guidance for efficient, scalable cross-lingual adaptation in data-constrained settings and motivate broader exploration of explicit modular architectures and interpretability in LLM parameterization.

Abstract

Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.

The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

TL;DR

This work probes how to transfer math reasoning and multilingual capabilities to low-resource languages in dense LLMs by inducing modularity in fine-tuning and via model merging. It demonstrates that math and language abilities can be allocated to distinct parameter subsets and that reassembling these via Layer-Swapping often yields stronger cross-lingual performance than joint multi-task fine-tuning. The study provides both empirical evidence and theoretical intuition—rooted in the linearity of task vectors and the effectiveness of training with subsequent resets—that explain why Layer-Swapping works well and why train-then-revert beats freeze-then-train. The findings offer practical guidance for efficient, scalable cross-lingual adaptation in data-constrained settings and motivate broader exploration of explicit modular architectures and interpretability in LLM parameterization.

Abstract

Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.

Paper Structure

This paper contains 35 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Illustration of the three methods that induce modularity by imposing target language capabilities (brown) and mathematical reasoning (blue) on separate LLM parameters. [1] is from bandarkar2025layer
  • Figure 2: Per-language breakdown of the average performance gain seen during our different types of training, averaged across four models. We see that while math-only SFT (green) does well for Swahili and mixed-data SFT (red) does well for Bengali, our two modular solutions work consistently well across the three languages. Note: the y-axis is a percentage because the evaluation score is accuracy, not because this table displays percent change.