Table of Contents
Fetching ...

Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, Ahmet Üstün

TL;DR

Nexus tackles the trade-off between efficiency, adaptability, and specialization in Mixture-of-Experts by introducing a domain-embedding projection router that maps domain representations to expert embeddings. This enables sparse upcycling of independently trained dense experts and allows efficient extension with new domains via a learned projection, avoiding full MoE retraining. Empirically, Nexus yields up to 2.1% relative gains during initial upcycling and up to 18.8% when extending with a new expert using limited finetuning, while preserving expert specialization (e.g., domain routing concentrates on the corresponding expert). The approach facilitates an open, modular MoE ecosystem where users can assemble customized MoE mixtures with minimal computational overhead for adding new domains. Overall, Nexus demonstrates robust performance across scales (470M and 2.8B seed models) and data domains, offering a practical path to adaptable, specialized, and scalable MoE systems.

Abstract

Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.

Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

TL;DR

Nexus tackles the trade-off between efficiency, adaptability, and specialization in Mixture-of-Experts by introducing a domain-embedding projection router that maps domain representations to expert embeddings. This enables sparse upcycling of independently trained dense experts and allows efficient extension with new domains via a learned projection, avoiding full MoE retraining. Empirically, Nexus yields up to 2.1% relative gains during initial upcycling and up to 18.8% when extending with a new expert using limited finetuning, while preserving expert specialization (e.g., domain routing concentrates on the corresponding expert). The approach facilitates an open, modular MoE ecosystem where users can assemble customized MoE mixtures with minimal computational overhead for adding new domains. Overall, Nexus demonstrates robust performance across scales (470M and 2.8B seed models) and data domains, offering a practical path to adaptable, specialized, and scalable MoE systems.

Abstract

Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
Paper Structure (17 sections, 5 equations, 9 figures, 3 tables)

This paper contains 17 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Depiction of Nexus for a single Transformer block:A) In the initial training phase, each expert is trained separately. Furthermore, its training data is embedded by an embedding model and stored. The experts are combined by initializing each block's MoE layer with the seed model and each of the experts' FFN layers, and finetuning the model on a mix of all domains. During a forward pass, the seed model FFN is used as the shared expert and always activated. For the other experts, we perform top-1 routing based on the similarity of the transformed expert embeddings with the input data. B) Later, we can add a new expert by appending its training data embedding to the existing domain embeddings. The router function is independent of the number of experts, and therefore adapts fast to the new one.
  • Figure 2: Router layer in Nexus: PyTorch-like pseudo-code illustrating a router layer, which consists of a 2-layer MLP network (domain_to_expert_ffn) to project domain embeddings to expert embeddings, shared and routed expert FFNs, and sparse Top-k gating. Note that the expert embeddings are independent of the input and could be precomputed once and stored during inference.
  • Figure 3: Downstream performance at different scales:Nexus consistently outperforms upcycled baselines on both the 470M and 2.8B parameters scale, showing the robustness of our method. We report the average performance on Knowledge, Science, Reasoning and MMLU.
  • Figure 4: Extending upcycled MoE models with the Code experts: After initial upcycling, we extended MoEs (both Nexus and MoE with linear router) using an independently trained dense Code expert and finetuned the resulting models small number of tokens (200M, 500M, and 1B finetuning tokens) as described in \ref{['sec:extend-moe']}. Nexus consistently outperforms the baseline in Code performance after extension without losing general performance. General tasks is the macro average of the knowledge, science, reasoning, and general knowledge categories reported in section \ref{['pretrain_results']}. Note that the dense Code expert achieves scores of 42.1 and 14.3 for general and code tasks respectively.
  • Figure 5: Average routing probabilities for each expert per domain in Nexus: We compute the average routing probabilities across Transformer blocks for 512 samples per domain (from the 2.8B experiment). The labels on the x-axis represent the domain of the samples and the colored bars show the routing probabilities for the corresponding expert. We show token routing probabilities for the domains that are used to train specialized experts.
  • ...and 4 more figures