Table of Contents
Fetching ...

Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing

Kunfeng Lai, Zhenheng Tang, Xinglin Pan, Peijie Dong, Xiang Liu, Haolan Chen, Li Shen, Bo Li, Xiaowen Chu

TL;DR

Mediator tackles the challenge of merging finetuned LLMs by addressing parameter conflicts with adaptive layer-wise merging and high-conflict layer routing to task-specific experts, while introducing dense base plus sparse task arithmetics to dramatically reduce storage. It uses uncertainty-based, task-level routing to select and combine experts, enabling robust performance on both in-distribution and OOD data. Empirical results on LLaMA and Qwen demonstrate consistent performance gains over static and dynamic merging baselines with substantially lower system costs, including the ability to run a 7B×4 ensemble on a single RTX 4090. The work advances practical LLM merging by balancing accuracy, efficiency, and robustness, with clear avenues for theoretical grounding and scalable deployment.

Abstract

Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual models during inference, it imposes excessive storage and compute costs, and fails to leverage the common knowledge from different models. In this work, we observe that different layers exhibit varying levels of parameter conflicts. Building on this insight, we average layers with minimal parameter conflicts and use a novel task-level expert routing for layers with significant conflicts. To further reduce storage costs, inspired by task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense expert and several sparse experts. Considering the out-of-distribution samples, we select and merge appropriate experts based on the task uncertainty of the input data. We conduct extensive experiments on both LLaMA and Qwen with varying parameter scales, and evaluate on real-world reasoning tasks. Results demonstrate that our method consistently achieves significant performance improvements while requiring less system cost compared to existing methods.

Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing

TL;DR

Mediator tackles the challenge of merging finetuned LLMs by addressing parameter conflicts with adaptive layer-wise merging and high-conflict layer routing to task-specific experts, while introducing dense base plus sparse task arithmetics to dramatically reduce storage. It uses uncertainty-based, task-level routing to select and combine experts, enabling robust performance on both in-distribution and OOD data. Empirical results on LLaMA and Qwen demonstrate consistent performance gains over static and dynamic merging baselines with substantially lower system costs, including the ability to run a 7B×4 ensemble on a single RTX 4090. The work advances practical LLM merging by balancing accuracy, efficiency, and robustness, with clear avenues for theoretical grounding and scalable deployment.

Abstract

Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual models during inference, it imposes excessive storage and compute costs, and fails to leverage the common knowledge from different models. In this work, we observe that different layers exhibit varying levels of parameter conflicts. Building on this insight, we average layers with minimal parameter conflicts and use a novel task-level expert routing for layers with significant conflicts. To further reduce storage costs, inspired by task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense expert and several sparse experts. Considering the out-of-distribution samples, we select and merge appropriate experts based on the task uncertainty of the input data. We conduct extensive experiments on both LLaMA and Qwen with varying parameter scales, and evaluate on real-world reasoning tasks. Results demonstrate that our method consistently achieves significant performance improvements while requiring less system cost compared to existing methods.

Paper Structure

This paper contains 63 sections, 2 theorems, 21 equations, 8 figures, 29 tables, 2 algorithms.

Key Result

Lemma 2.1

xie2022an let $\mathcal{B}$ denotes the set of $\tau$ which does not satisfy Condition cond:distinguish. We assume that $\text{KL}(p_{prompt}(y_\text{test}|x_\text{test}))||p(y_\text{test}|x_\text{test},\tau)$ is bounded for all $\tau$ and that $\tau^\perp$ minimizes the multi-class logistic risk as If then where $g(\nu) = \frac{1}{2}((1-\nu)\log(1-\nu)+(1+\nu)\log(1+\nu))$ is the calibration fu

Figures (8)

  • Figure 1: Knowledge conflict across finetuned LLMs and math and code dataset. Deeper color means larger parameter conflicts. And it is difficult for the linear averaged model to achieve low loss of both tasks.
  • Figure 2: Parameter conflict distribution across different layers of finetuned models (Qwen 2.5 7B).
  • Figure 3: The framework of Mediator.
  • Figure 4: Comparing magnitudes of task arithmetic and pretrained model parameters.
  • Figure 5: The inference timeline of Mediator, assuming that the number of layers is three.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 3.1: Task Arithmetic
  • Lemma 2.1
  • Theorem 2.2