Table of Contents
Fetching ...

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang

TL;DR

This work introduces a compact floor+tail scaling law that predicts the cross-entropy loss when merging multiple domain experts into a base language model: $\mathbb{E}[L|N,k] = L_\infty(N) + \frac{A(N)}{k+b}$, with $L_\infty(N)=L_\ast + B N^{-\beta}$ and $A(N)=A_0 N^{-\gamma}$. The law holds across in-domain and cross-domain settings, nine domains, and four merging methods, and it shows larger base models lower both the floor and the tail, while gains saturate early around $k \approx 5$–$6$. The authors also provide a theoretical justification via a second-order expansion yielding a $1/k$ tail and a bound on the variance, and they demonstrate transfer to backbones such as LLaMA with high fidelity. Practically, the law enables predictive curve fitting from a few early points, budget-aware decisions on the number of experts to merge, and principled trade-offs between expanding the base model and increasing the pool of experts, offering a scalable path toward modular, distributed generative AI systems.

Abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Model Merging Scaling Laws in Large Language Models

TL;DR

This work introduces a compact floor+tail scaling law that predicts the cross-entropy loss when merging multiple domain experts into a base language model: , with and . The law holds across in-domain and cross-domain settings, nine domains, and four merging methods, and it shows larger base models lower both the floor and the tail, while gains saturate early around . The authors also provide a theoretical justification via a second-order expansion yielding a tail and a bound on the variance, and they demonstrate transfer to backbones such as LLaMA with high fidelity. Practically, the law enables predictive curve fitting from a few early points, budget-aware decisions on the number of experts to merge, and principled trade-offs between expanding the base model and increasing the pool of experts, offering a scalable path toward modular, distributed generative AI systems.

Abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Paper Structure

This paper contains 65 sections, 3 theorems, 34 equations, 18 figures, 17 tables, 1 algorithm.

Key Result

Theorem 1

Under the assumptions above (equal weights), for each fixed $N$ the population-averaged loss over $k$ merged experts satisfies the second-order law where $H$ denotes an approximation to the Hessian matrix, and $\mu, \Sigma$ represent respectively the mean and covariance of task vectors in the merged subspace. In particular, the empirical family equation eq:merge-law appears with $b(N)=0$ at leadi

Figures (18)

  • Figure 1: Model Merging Scaling Law. CE vs. number of merged experts ($k$) at multiple model sizes ($N$) for four merging methods. Dots are real measurements; dotted lines are fits to the unified law $L_{\infty}(N)+A(N)/(k+b)$. Across methods we see the same pattern: steep early gains that flatten into a $1/(k{+}b)$ tail, and a uniform downward shift with larger $N$ (both the floor and the tail shrink). Method differences get smaller and smaller as scaling up. $R^2>0.98$ over all fitted points
  • Figure 2: Overview of Merging vs MultiTask. The the polar axis represents the normalized negative loss.
  • Figure 3: Merging Scaling Law in a single algebra domain. (left) CE vs. number of merged experts ($k$) (middle) C vs. multiple model sizes ($N$). Dots are real measurements; lines are fits to the unified law $L_{\infty}(N)+A(N)/(k+b)$ on the single domain. (right) Variance of CE decreases as CE.
  • Figure 4: Larger models are easier to merge. (Left) Per-domain floors $L_\infty(N)$ fall monotonically with model size $N$. (Middle) Tail amplitude $A(N)$ is small and overall flat-to-decreasing with $N$. Most of the gain comes from the first few experts. (Right) Median fractional return $R(k)$ with IQR band; $k{=}5$ and $k{=}6$ cross the $85\%$/$90\%$ thresholds, respectively. This means only 60% of experts in the expert pool can get over 90% performance.
  • Figure 5: Method sensitivity is little at scale.Left: Mean CE vs. $k$ at $N{=}32$B—all methods follow the power law; the early-$k$ lead of TA/TIES($0.5$) is small ($\sim$1–2%) and narrows by $k{\gtrsim}8$. Right: Variance vs. $k$ at $N{=}32$B, near-$1/k$ contraction; TIES/TA $<$ Average at small $k$, and all methods meet near the variance floor by $k{\approx}8$. Curves show measurements (markers) and floor+tail fits (lines) with a shared small $b$ per method.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Theorem 1: Average-case joint merging law
  • Corollary 1: Variance shrinkage
  • Lemma 1: Moments of the mean-corrected step
  • proof