Table of Contents
Fetching ...

Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging

Weihong Lin, Lin Sun, Qilong Shi, Aomufei Yuan, Yuxuan Tian, Zhengyang Wang, Guangxiang Zhao, Xiangzheng Zhang, Tong Yang

TL;DR

The paper tackles instability in weight-space model merging, where dense updates amplify low-probability degeneration. It introduces Sparse Complementary Fusion with reverse KL (SCF-RKL), a distribution-aware, sparsity-driven framework that uses reverse KL divergence to identify high-information parameters and applies a binary mask to fuse only those coordinates. The authors provide theoretical guarantees on semantic stability, entropy preservation, and bounded subspace rotation, and demonstrate robust, cross-domain gains across 24 benchmarks (language and vision) without data-dependent fine-tuning. The work shows that distribution-aware sparse fusion can improve reasoning, safety, and cross-modal performance while maintaining the base-model capabilities, offering a practical, robust approach to stable model merging.

Abstract

Model merging has emerged as a promising paradigm for composing the capabilities of large language models by directly operating in weight space, enabling the integration of specialized models without costly retraining. However, existing merging methods largely rely on parameter-space heuristics, which often introduce severe interference, leading to degraded generalization and unstable generation behaviors such as repetition and incoherent outputs. In this work, we propose Sparse Complementary Fusion with reverse KL (SCF-RKL), a novel model merging framework that explicitly controls functional interference through sparse, distribution-aware updates. Instead of assuming linear additivity in parameter space, SCF-RKL measures the functional divergence between models using reverse Kullback-Leibler divergence and selectively incorporates complementary parameters. This mode-seeking, sparsity-inducing design effectively preserves stable representations while integrating new capabilities. We evaluate SCF-RKL across a wide range of model scales and architectures, covering both reasoning-focused and instruction-tuned models. Extensive experiments on 24 benchmarks spanning advanced reasoning, general reasoning and knowledge, instruction following, and safety demonstrate, vision classification that SCF-RKL consistently outperforms existing model merging methods while maintaining strong generalization and generation stability.

Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging

TL;DR

The paper tackles instability in weight-space model merging, where dense updates amplify low-probability degeneration. It introduces Sparse Complementary Fusion with reverse KL (SCF-RKL), a distribution-aware, sparsity-driven framework that uses reverse KL divergence to identify high-information parameters and applies a binary mask to fuse only those coordinates. The authors provide theoretical guarantees on semantic stability, entropy preservation, and bounded subspace rotation, and demonstrate robust, cross-domain gains across 24 benchmarks (language and vision) without data-dependent fine-tuning. The work shows that distribution-aware sparse fusion can improve reasoning, safety, and cross-modal performance while maintaining the base-model capabilities, offering a practical, robust approach to stable model merging.

Abstract

Model merging has emerged as a promising paradigm for composing the capabilities of large language models by directly operating in weight space, enabling the integration of specialized models without costly retraining. However, existing merging methods largely rely on parameter-space heuristics, which often introduce severe interference, leading to degraded generalization and unstable generation behaviors such as repetition and incoherent outputs. In this work, we propose Sparse Complementary Fusion with reverse KL (SCF-RKL), a novel model merging framework that explicitly controls functional interference through sparse, distribution-aware updates. Instead of assuming linear additivity in parameter space, SCF-RKL measures the functional divergence between models using reverse Kullback-Leibler divergence and selectively incorporates complementary parameters. This mode-seeking, sparsity-inducing design effectively preserves stable representations while integrating new capabilities. We evaluate SCF-RKL across a wide range of model scales and architectures, covering both reasoning-focused and instruction-tuned models. Extensive experiments on 24 benchmarks spanning advanced reasoning, general reasoning and knowledge, instruction following, and safety demonstrate, vision classification that SCF-RKL consistently outperforms existing model merging methods while maintaining strong generalization and generation stability.
Paper Structure (29 sections, 5 theorems, 21 equations, 21 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 5 theorems, 21 equations, 21 figures, 10 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $q$ and $p$ denote the output distributions of the base and secondary models, respectively. If the fusion mask $M$ is constructed based on the Reverse-KL importance, the KL divergence of the fused distribution $q_f$ satisfies:

Figures (21)

  • Figure 1: Repetition rates across fusion methods on Mistral-7B, Qwen2.5-14B, and Qwen2.5-32B. Baseline merging methods severely amplify repetition even when base models exhibit near-zero rates, whereas SCF-RKL consistently maintains low repetition ($<$1.6%) across all scales.
  • Figure 2: Unstable generation in existing fusion: (Left) catastrophic repetition and (Right) incoherent outputs—whereas SCF-RKL preserves both correctness and coherence.
  • Figure 3: Principal angle rotation analysis at layer 15 across different fusion methods. The gray dashed line represents the baseline gap between parent models M0 and M1.
  • Figure 4: Normalized Spectral Shift (NSS) for all fusion methods. Lower NSS values indicate better preservation of the parent models' spectral properties. Note: The y-axis scale for SCF-RKL has been adjusted for better visualization.
  • Figure 5: Singular value spectrum comparison. The inset shows a zoomed view around rank 100, revealing the subtle differences between fusion methods despite apparent overlap in the main plot.
  • ...and 16 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Semantic Stability
  • Theorem 3.2: Entropy Preservation
  • Theorem 3.3: Bound on Subspace Rotation
  • proof
  • proof
  • proof
  • Corollary 1.1: Minimization of Normalized Spectral Shift
  • Proposition 1.2: Base Model Dominance
  • proof