Table of Contents
Fetching ...

Training-Free Dynamic Upcycling of Expert Language Models

Eros Fanì, Oğuzhan Ersoy

Abstract

Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model's original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.

Training-Free Dynamic Upcycling of Expert Language Models

Abstract

Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model's original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.

Paper Structure

This paper contains 19 sections, 7 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the DUME method. For simplicity and illustration purposes, we only show a single transformer block $l$ among the total $L$ blocks. 1) Dense experts are trained on different domains. 2) MoErging phase and ridge regression statistics extraction: the parameters of all the layers but the MLP layers, including the self-attention layers, are averaged. Each input token from all the domains is forwarded to the expert corresponding to its domain label. The feature maps extracted from the layers preceding the MLP blocks are used to update the $A_l$ and $b_l$ matrices. 3) The parameters $W_l^*$ are computed via Equation \ref{['eq:rr-solution']} using the matrices computed at the previous step.
  • Figure 2: (a) Normalized average performance in both CLM and reasoning settings, with various numbers of tokens used to extract the ridge regression statistics for each paragraph. (b) Normalized average performance in both CLM and reasoning settings, as $k$ of the top-k routing selecting increases. (c) Normalized average performance in both CLM and reasoning settings, with various values for the Tikhonov regularizer.