Table of Contents
Fetching ...

DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models

Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, Tao Chen

TL;DR

This work tackles parameter inefficiency in upcycled Mixture-of-Experts models by revealing that expert weights share a common base $W_{base}$ and only small deltas $\Delta_i$ need adjustment. The DeRS paradigm decomposes each expert into $W_{base}$ plus a lightweight representation $\mathcal{F}(\Delta_i)$, enabling DeRS compression (sparsification/quantization) for inference and DeRS upcycling (sparse or low-rank increments) for training. Across general multimodal, medical multimodal, and code-generation tasks, DeRS achieves extreme parameter efficiency—reducing extra parameters by up to thousands of times—while preserving or improving performance. Extended DeRS further applies to the universal FFN, enabling substantial compression and efficiency gains during both training and deployment, with practical implications for scalable, resource-efficient MoE systems.

Abstract

Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (Decompose, Replace, and Synthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.

DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models

TL;DR

This work tackles parameter inefficiency in upcycled Mixture-of-Experts models by revealing that expert weights share a common base and only small deltas need adjustment. The DeRS paradigm decomposes each expert into plus a lightweight representation , enabling DeRS compression (sparsification/quantization) for inference and DeRS upcycling (sparse or low-rank increments) for training. Across general multimodal, medical multimodal, and code-generation tasks, DeRS achieves extreme parameter efficiency—reducing extra parameters by up to thousands of times—while preserving or improving performance. Extended DeRS further applies to the universal FFN, enabling substantial compression and efficiency gains during both training and deployment, with practical implications for scalable, resource-efficient MoE systems.

Abstract

Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (Decompose, Replace, and Synthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.

Paper Structure

This paper contains 24 sections, 6 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: Visualization of cosine similarity in the first and last MoE layers of the MoE-LLaVA-Phi lin2024moe model. FFN denotes the initial weight while $E_{i}$ denotes the trained weight of the $i$-th expert.
  • Figure 2: Overall procedure of (a) upcycling a dense model into a MoE model through vanilla upcycling and (b) compressing the vanilla upcycled MoE model using the proposed DeRS compression, which first decomposes the trained experts and then applies sparsification or quantization techniques to the expert-specific delta weights. During inference, when an expert is needed, we synthesize its weight online.
  • Figure 3: Comparisons between vanilla upcycling and the proposed DeRS upcycling in the construction of experts. Instead of making $N$ copies of the original FFN, our DeRS upcycling synthesizes experts by combining the shared FFN with expert-specific lightweight parameters (i.e., in the form of sparse matrixes or low-rank matrixes).
  • Figure 4: Performance of applying different DeRS compression methods to compress three vanilla upcycled MoE-LLaVA models respectively.
  • Figure 5: Performance of applying different DeRS compression methods to compress two vanilla upcycled Med-MoE models respectively.
  • ...and 3 more figures