What Matters for Model Merging at Scale?

Prateek Yadav; Tu Vu; Jonathan Lai; Alexandra Chronopoulou; Manaal Faruqui; Mohit Bansal; Tsendsuren Munkhdalai

What Matters for Model Merging at Scale?

Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, Tsendsuren Munkhdalai

TL;DR

This work addresses how scalable model merging behaves as model size, base quality, and the number of experts increase. It systematically evaluates four merging methods on PaLM-2 bases (including an instruction-tuned variant) across 1B–64B parameters and up to 8 experts, using held-in and held-out tasks from the T0 mixture. The results show that instruction-tuned bases and larger models substantially improve merge performance and zero-shot generalization, with merged models sometimes outperforming multitask baselines when many large experts are combined, and that simple averaging remains effective at scale. These findings provide practical guidelines for scalable, decentralized model merging and set a benchmark for future large-scale merging research.

Abstract

Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

What Matters for Model Merging at Scale?

TL;DR

Abstract

Paper Structure (40 sections, 7 figures, 4 tables)

This paper contains 40 sections, 7 figures, 4 tables.

Introduction
Background
Model Merging Methods
Averaging
Averaging
Task Arithmetic
$\mathtt{TIES}$ Merging
Dare Merging
Challenges/Limitations
Most Studies on Small Models ($<\mathtt{7B}$ parameters):
Model Merging Studies with Large Models are Shallow:
Varied Evaluation Setups:
Large Scale Evaluation of Model Merging
Data:
Expert Model Creation:
...and 25 more sections

Figures (7)

Figure 1: Held-In performance results from our large scale model merging experiments conducted over keys factors like base models, model sizes, merging methods, and number of experts being merged. We present results for two base models, $\mathtt{PaLM}\texttt{-}\mathtt{2}$ and an instruction tuned version of it, $\mathtt{PaLM}\texttt{-}\mathtt{2}\texttt{-}\mathtt{IT}$, four different models sizes $\mathtt{(1B,8B,24B,64B)}$, four merging methods $(\mathtt{Averaging}$, $\mathtt{Task~Arithmetic}$, $\mathtt{Dare}$-$\mathtt{TIES}$, and $\mathtt{TIES}$-$\mathtt{Merging})$, when merging either $\mathtt{2}$ or $\mathtt{8}$ expert models. We report the performance normalized with the oracle expert's performance which is denoted by the bold black circle of radius $\mathtt{1}$. We also present the performance of multitask baseline train on the held-in tasks. We find merging expert models created from the instruction tuned $\mathtt{PaLM}\texttt{-}\mathtt{2}\texttt{-}\mathtt{IT}$ model always performs better than merging $\mathtt{PaLM}\texttt{-}\mathtt{2}$ based experts. Moreover, the gap between these model increase when we merge more experts. Larger experts ($\mathtt{64B}$) merge better and show the best held-in performance.
Figure 2: Merged experts created from big and strong base models generalize better than multitask models. We find that for strong base models as we merge more experts (x-axis, $\rightarrow$), the merged model's generalization performance (y-axis, $\uparrow$) monotonically increases to approach and eventually surpasses multitask baseline. (yellow line). More details in Section \ref{['sec:generalize_better']}.
Figure 3: Instruction-tuned models facilitate easier merging.$\mathtt{PaLM}\texttt{-}\mathtt{2}\texttt{-}\mathtt{IT}$ ( •) consistently outperforms $\mathtt{PaLM}\texttt{-}\mathtt{2}$ ( •) as shown by the huge gap between the green point ( •) being higher than red points ( •), across various merging methods, model sizes, and numbers of constituent models, indicating that stronger instruction-tuned base models enhance the performance of merged models. The dashed lines denoted the performance of the experts trained on the held-in tasks as defined in § \ref{['sec:experimental_setup']}. For more details see Section \ref{['sec:better_base_is_better']}.
Figure 4: Bigger models merge better. On Held-In evaluations, we find that bigger models always perform better compared to smaller models, barring a few outliers. We find that large instruction tuned models like $\mathtt{64B}$$\mathtt{PaLM}\texttt{-}\mathtt{2}\texttt{-}\mathtt{IT}$ are the easiest to merge. For more details see Section \ref{['sec:bigger_is_easier']}.
Figure 5: Merged models at scale generalize better. We plot the held-out generalization of the merged model for two merging methods. We also include the performance of base model (dashed line) and the multitask baseline (yellow line) which trained on a mixture of held-in tasks. We find that the number of constituent expert models (x-axis, $\rightarrow$) had little effect on zero-shot generalization as shown in the left and center plots. However, increasing model size significantly to $\mathtt{64B}$ improved the merged model's performance over the base model (right plot). For more details see Section \ref{['sec:generalize_better']}.
...and 2 more figures

What Matters for Model Merging at Scale?

TL;DR

Abstract

What Matters for Model Merging at Scale?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)