Table of Contents
Fetching ...

Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking

Dengming Zhang, Xiaowen Ma, Zhenliang Ni, Zhenkai Wu, Han Shu, Xin Jiang, Xinghao Chen

TL;DR

This work tackles the problem of merging multiple domain-specific experts into a single, scalable model without full retraining. It introduces Expert Merging, which learns per-layer coefficients to explicitly align the merged model’s hidden states and logits with each expert using unlabeled calibration data, and Expert Merging++ which uses importance-guided chunking to allocate more capacity to high-importance layers. The approach combines hidden-state and logit alignment losses, coefficient regularization for stability, and task-weighted trade-offs, achieving state-of-the-art or competitive performance across LLM and MLLM backbones, sometimes surpassing supervised Mixture Training. The results demonstrate robust cross-domain performance with label-free calibration and parameter-efficient merging, offering a practical solution for deploying multi-domain capabilities at scale and guiding future research on inter-layer heterogeneity in model merging.

Abstract

Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model's hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. The source code is available at https://github.com/Littleor/ExpertMerging.

Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking

TL;DR

This work tackles the problem of merging multiple domain-specific experts into a single, scalable model without full retraining. It introduces Expert Merging, which learns per-layer coefficients to explicitly align the merged model’s hidden states and logits with each expert using unlabeled calibration data, and Expert Merging++ which uses importance-guided chunking to allocate more capacity to high-importance layers. The approach combines hidden-state and logit alignment losses, coefficient regularization for stability, and task-weighted trade-offs, achieving state-of-the-art or competitive performance across LLM and MLLM backbones, sometimes surpassing supervised Mixture Training. The results demonstrate robust cross-domain performance with label-free calibration and parameter-efficient merging, offering a practical solution for deploying multi-domain capabilities at scale and guiding future research on inter-layer heterogeneity in model merging.

Abstract

Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model's hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. The source code is available at https://github.com/Littleor/ExpertMerging.

Paper Structure

This paper contains 31 sections, 4 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Conceptual overview: (a) Task Vector, (b) Task Arithmetic (TA), (c) Expert Merging, and (d) Expert Merging++ with importance-guided chunk-wise coefficients.
  • Figure 2: Architecture of Expert Merging++. The base and expert models are frozen (snowflakes) and only the chunk-wise coefficients $\{\boldsymbol{\alpha}_k^{\ell}\}$ are trainable (flames). For unlabeled inputs from each domain, the merged model aligns both hidden states $\mathbf{h}_{\ell}$ (green) and logits $\mathbf{z}$ (yellow) to the corresponding expert, producing a single model that preserves all task performance.
  • Figure 3: Model-wise parameter size and learned coefficient across backbones after normalizing. Full stage-wise trends for panel (b) are provided in Appendix (Figure \ref{['fig:coefficients_share_by_stage']}).
  • Figure 4: Layer importance by stage and submodule across backbones.
  • Figure 5: Learned coefficient share across depth (early/middle/late) for each submodule. Late-stage blocks consistently receive larger shares for both attention and MLP; InternVL shows a stronger late-stage skew on the MLP down-projection, while Qwen’s shift is milder.