Model Merging Scaling Laws in Large Language Models

Yuanyi Wang; Yanggan Gu; Yiming Zhang; Qi Zhou; Zhaoyi Yan; Congkai Xie; Xinyao Wang; Jianbo Yuan; Hongxia Yang

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang

TL;DR

This work introduces a compact floor+tail scaling law that predicts the cross-entropy loss when merging multiple domain experts into a base language model: $\mathbb{E}[L|N,k] = L_\infty(N) + \frac{A(N)}{k+b}$, with $L_\infty(N)=L_\ast + B N^{-\beta}$ and $A(N)=A_0 N^{-\gamma}$. The law holds across in-domain and cross-domain settings, nine domains, and four merging methods, and it shows larger base models lower both the floor and the tail, while gains saturate early around $k \approx 5$–$6$. The authors also provide a theoretical justification via a second-order expansion yielding a $1/k$ tail and a bound on the variance, and they demonstrate transfer to backbones such as LLaMA with high fidelity. Practically, the law enables predictive curve fitting from a few early points, budget-aware decisions on the number of experts to merge, and principled trade-offs between expanding the base model and increasing the pool of experts, offering a scalable path toward modular, distributed generative AI systems.

Abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Model Merging Scaling Laws in Large Language Models

TL;DR

Abstract

Model Merging Scaling Laws in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (4)