Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
Qi Wang, Hanyang Peng, Yue Yu
TL;DR
Symphony-MoE tackles the challenge of constructing a scalable Mixture-of-Experts by upcycling experts from multiple identically-architected but independently trained dense models. It introduces a two-stage framework: a training-free functional alignment that harmonizes a shared backbone and aligns FFN neurons via neuron permutation, followed by a post-training coordination stage with a learnable router and a load-balancing objective. Empirically, Symphony-MoE outperforms strong upcycling baselines on multi-domain in-distribution tasks and demonstrates robust out-of-distribution generalization, including medicine-domain evaluation, across 0.5B and 1.5B scales and different backbone types. The work shows that activating diverse, heterogeneously trained experts in a coherent MoE can preserve domain expertise while enabling cross-domain transfer, offering a scalable path to leveraging existing specialized models.
Abstract
Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Qwen2.5-Coder and Qwen2). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a stage of post-training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.
