Table of Contents
Fetching ...

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki

TL;DR

This work tackles the high resource cost of training large-scale MoE models by introducing Drop-Upcycling, a method that starts from a pre-trained dense model, replicates FFN layers as MoE, and then selectively re-initializes a subset of FFN parameters to foster expert diversification. By combining knowledge transfer with targeted diversification, Drop-Upcycling maintains fast initial gains while achieving long-term convergence comparable to training from scratch, all at a fraction of the training FLOPs. Across multiple scales (up to 8×3.7B) and hundreds of billions of tokens, the approach yields an MoE with 5.9B active parameters that matches or surpasses a 13B dense baseline, while significantly reducing compute. The paper provides extensive open resources to support reproducibility and further research in efficient MoE training.

Abstract

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

TL;DR

This work tackles the high resource cost of training large-scale MoE models by introducing Drop-Upcycling, a method that starts from a pre-trained dense model, replicates FFN layers as MoE, and then selectively re-initializes a subset of FFN parameters to foster expert diversification. By combining knowledge transfer with targeted diversification, Drop-Upcycling maintains fast initial gains while achieving long-term convergence comparable to training from scratch, all at a fraction of the training FLOPs. Across multiple scales (up to 8×3.7B) and hundreds of billions of tokens, the approach yields an MoE with 5.9B active parameters that matches or surpasses a 13B dense baseline, while significantly reducing compute. The paper provides extensive open resources to support reproducibility and further research in efficient MoE training.

Abstract

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

Paper Structure

This paper contains 37 sections, 8 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overview of the Drop-Upcycling method. The key difference from the naïve Upcycling is Diversity re-initialization, introduced in Section \ref{['sec:method']}.
  • Figure 2: Initialization of expert weights. Columns (rows) are selected according to a set of randomly selected indices of the intermediate layer $\mathcal{S}$, then all elements of them are re-initialized with the normal distribution. Other columns (rows) are maintained.
  • Figure 3: Comparison of learning curves for different MoE construction methods. The top and bottom rows illustrate the changes in training loss and downstream task scores during training, respectively. In both metrics, the proposed method, Drop-Upcycling with $r=0.5$, achieves the best performance, gaining initial knowledge transfer while avoiding convergence slowdown.
  • Figure 4: Impact of re-initialization ratio $r$. The training loss and downstream task score over the total number of tokens processed during training on 8×152M (left two figures) and 8×1.5B (right two figures) settings are illustrated. Even with different $r$ values, Drop-Upcycling robustly outperforms naïve Upcycling, and 0.5 appears to be the most effective ratio.
  • Figure 5: Comparison of expert routing patterns across different MoE construction methods. Drop-Upcycling exhibits more balanced expert utilization than naïve Upcycling. Results shown for layers 0 (first), 8, 16, and 23 (last); see Appendix \ref{['subsec:detailed_routing_analysis']} for results on all layers.
  • ...and 8 more figures