Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, Ahmet Üstün
TL;DR
Nexus tackles the trade-off between efficiency, adaptability, and specialization in Mixture-of-Experts by introducing a domain-embedding projection router that maps domain representations to expert embeddings. This enables sparse upcycling of independently trained dense experts and allows efficient extension with new domains via a learned projection, avoiding full MoE retraining. Empirically, Nexus yields up to 2.1% relative gains during initial upcycling and up to 18.8% when extending with a new expert using limited finetuning, while preserving expert specialization (e.g., domain routing concentrates on the corresponding expert). The approach facilitates an open, modular MoE ecosystem where users can assemble customized MoE mixtures with minimal computational overhead for adding new domains. Overall, Nexus demonstrates robust performance across scales (470M and 2.8B seed models) and data domains, offering a practical path to adaptable, specialized, and scalable MoE systems.
Abstract
Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
