Table of Contents
Fetching ...

Sparsity-Aware Evolution for Model Merging

Huan Zhang, Yanjian Zhang, Guillaume Wisniewski, Nadi Tomeh, Bang Liu

TL;DR

The paper addresses destructive interference in multi-parent model merging by introducing sparsity-aware evolution (SAE), which jointly optimizes task performance and structured sparsity within an archive-based, layer-wise merging framework. It defines a merged-space $\\Theta_{\\mathcal{M}}$ and a layer-wise mixing rule with $\\lambda_r^{(l)} = \frac{s_A + \omega_A^{(l)}}{(s_A + \omega_A^{(l)}) + (s_B + \omega_B^{(l)})}$, and augments the fitness with sparsity signals to drive a dense-sparse-dense search that promotes modularity. Key contributions include sparsity-induced attraction, annealing sparsification via cyclic schedules, and empirical validation on large-scale LLM benchmarks where SAE outperforms strong baselines like PSO and yields smoother loss landscapes. The approach offers a scalable, data-free path to fuse diverse competencies while constraining interference, with practical impact for robust, multi-task LLM fusion and potential applicability beyond homologous architectures.

Abstract

We propose a sparsity-aware evolutionary (SAE) framework for model merging that involves iterative pruning-merging cycles to act as a novel mutation operator. We incorporate the sparsity constraints into the score function, which steers the evolutionary process to favor more sparse models, in addition to other conventional performance scores. Interestingly, the by-product of \textit{competition} for sparsity introduces an extra local \textit{attraction} and interplay into the evolutionary process: if one competitor has more zero elements, the other competitor's non-zero elements will occupy those positions, even though the less sparse competitor loses to the more sparse competitor in other positions. The proposed pipeline is evaluated on a variety of large-scale LLM benchmarks. Experiments demonstrate that our approach can improve model merging reliability across multiple benchmarks, and is easy to incorporate due to its simplicity and being orthogonal to most existing approaches.

Sparsity-Aware Evolution for Model Merging

TL;DR

The paper addresses destructive interference in multi-parent model merging by introducing sparsity-aware evolution (SAE), which jointly optimizes task performance and structured sparsity within an archive-based, layer-wise merging framework. It defines a merged-space and a layer-wise mixing rule with , and augments the fitness with sparsity signals to drive a dense-sparse-dense search that promotes modularity. Key contributions include sparsity-induced attraction, annealing sparsification via cyclic schedules, and empirical validation on large-scale LLM benchmarks where SAE outperforms strong baselines like PSO and yields smoother loss landscapes. The approach offers a scalable, data-free path to fuse diverse competencies while constraining interference, with practical impact for robust, multi-task LLM fusion and potential applicability beyond homologous architectures.

Abstract

We propose a sparsity-aware evolutionary (SAE) framework for model merging that involves iterative pruning-merging cycles to act as a novel mutation operator. We incorporate the sparsity constraints into the score function, which steers the evolutionary process to favor more sparse models, in addition to other conventional performance scores. Interestingly, the by-product of \textit{competition} for sparsity introduces an extra local \textit{attraction} and interplay into the evolutionary process: if one competitor has more zero elements, the other competitor's non-zero elements will occupy those positions, even though the less sparse competitor loses to the more sparse competitor in other positions. The proposed pipeline is evaluated on a variety of large-scale LLM benchmarks. Experiments demonstrate that our approach can improve model merging reliability across multiple benchmarks, and is easy to incorporate due to its simplicity and being orthogonal to most existing approaches.
Paper Structure (20 sections, 5 equations, 5 figures, 6 tables)

This paper contains 20 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: $\theta_A$ and $\theta_B$ are pretrained LLMs that are to be merged into $\theta_{\mathcal{M}}$. Different sizes of circles represent the mixing ratios belonging to different parents. We maintain a large archive of models after generation $t=0$ to promote diversity based on local and global competition mechanisms. Note that for the generation $t=1$, the upper-right neuron does not exist, since the parents' corresponding neurons have been pruned in the generation $t=0$.
  • Figure 2: Evolutionary forces in sparsity-aware model merging. Evaluation and sparsity jointly act as a natural selection mechanism over offspring models, while pruning introduces directed exploration toward increasingly empty parameter regions. The merged model evolves within the space spanned by dense model, sparse model, and null space.
  • Figure 3: Convexity landscapes on MMLU-ProX. Each cell corresponds to a parameter point $\theta(\alpha,\beta)=\theta_0+\alpha d_1+\beta d_2$ along two random directions (layer-wise normalized), colored by a local convexity score computed from Hessian spectra: convexity = abs(lambda_min) / (abs(lambda_max) + eps) (clipped to [0, 0.5]). Brighter regions indicate more balanced positive/negative curvature (i.e., relatively stronger non-convexity), while darker regions indicate one-sided curvature dominance.
  • Figure 4: Convexity landscapes on GSM8K. Each cell corresponds to a parameter point $\theta(\alpha,\beta)=\theta_0+\alpha d_1+\beta d_2$ along two shared random directions (layer-wise normalized). Cells are colored by a Hessian-based convexity proxy computed from the extreme eigenvalues: convexity = abs(lambda_min) / (abs(lambda_max) + eps), clipped to $[0, 0.5]$. Brighter regions indicate more balanced positive/negative curvature, while darker regions indicate one-sided curvature dominance.
  • Figure 5: Loss landscapes along shared random directions. Each row corresponds to a single task, and each column compares the expert model, the SAE-merged model, and the PSO-merged model under the same random directions $(\alpha,\beta)$ in parameter space.