Table of Contents
Fetching ...

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

Qi Wang, Hanyang Peng, Yue Yu

TL;DR

Symphony-MoE tackles the challenge of constructing a scalable Mixture-of-Experts by upcycling experts from multiple identically-architected but independently trained dense models. It introduces a two-stage framework: a training-free functional alignment that harmonizes a shared backbone and aligns FFN neurons via neuron permutation, followed by a post-training coordination stage with a learnable router and a load-balancing objective. Empirically, Symphony-MoE outperforms strong upcycling baselines on multi-domain in-distribution tasks and demonstrates robust out-of-distribution generalization, including medicine-domain evaluation, across 0.5B and 1.5B scales and different backbone types. The work shows that activating diverse, heterogeneously trained experts in a coherent MoE can preserve domain expertise while enabling cross-domain transfer, offering a scalable path to leveraging existing specialized models.

Abstract

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Qwen2.5-Coder and Qwen2). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a stage of post-training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

TL;DR

Symphony-MoE tackles the challenge of constructing a scalable Mixture-of-Experts by upcycling experts from multiple identically-architected but independently trained dense models. It introduces a two-stage framework: a training-free functional alignment that harmonizes a shared backbone and aligns FFN neurons via neuron permutation, followed by a post-training coordination stage with a learnable router and a load-balancing objective. Empirically, Symphony-MoE outperforms strong upcycling baselines on multi-domain in-distribution tasks and demonstrates robust out-of-distribution generalization, including medicine-domain evaluation, across 0.5B and 1.5B scales and different backbone types. The work shows that activating diverse, heterogeneously trained experts in a coherent MoE can preserve domain expertise while enabling cross-domain transfer, offering a scalable path to leveraging existing specialized models.

Abstract

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Qwen2.5-Coder and Qwen2). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a stage of post-training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

Paper Structure

This paper contains 25 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison between the workflow of naive upcycling and ours.
  • Figure 2: Overview of the Symphony-MoE construction pipeline. In Stage 1, disparate source models are harmonized without training: non-FFN layers are merged into a shared backbone using techniques such as SLERP, while FFN layers are aligned through activation-based neuron permutation. In Stage 2, the experts, router, and shared backbone undergo post-training to enable coordination among the now-compatible components.
  • Figure 3: Quantitative analysis of inter-expert functional specialization using Centered Kernel Alignment (CKA). Lower CKA scores indicate greater functional specialization. High CKA scores between experts reflect parameter space misalignment, caused by failing to align neurons functionally during merging. This misalignment leads to representational collapse, erasing the distinct capabilities of individual experts.
  • Figure 4: Performance analysis of Symphony-MoE. (a) Average in-distribution (ID) and out-of-distribution (OOD) scores under different anchor model choices. (b) Impact of increasing the number of experts from 1 to 4 on ID and OOD performance.