Table of Contents
Fetching ...

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng

TL;DR

Mixture-of-Experts offers scalable language modeling but training MoEs from scratch is costly and unstable. This work builds MoEs from an existing dense LLaMA-2-7B by partitioning FFNs into multiple experts and performing continual pre-training with new gate networks, then systematically studies expert-construction methods and data-sampling/ filtering strategies. The results show that LLaMA-MoE-3.5B outperforms comparable activation-parameter baselines, with static data sampling and fluency filtering further accelerating convergence and improving downstream tasks. The study provides a practical, transparent pipeline for turning dense LLMs into sparse MoEs with meaningful performance gains and reduced compute, and releases code and models.

Abstract

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

TL;DR

Mixture-of-Experts offers scalable language modeling but training MoEs from scratch is costly and unstable. This work builds MoEs from an existing dense LLaMA-2-7B by partitioning FFNs into multiple experts and performing continual pre-training with new gate networks, then systematically studies expert-construction methods and data-sampling/ filtering strategies. The results show that LLaMA-MoE-3.5B outperforms comparable activation-parameter baselines, with static data sampling and fluency filtering further accelerating convergence and improving downstream tasks. The study provides a practical, transparent pipeline for turning dense LLMs into sparse MoEs with meaningful performance gains and reduced compute, and releases code and models.

Abstract

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .
Paper Structure (21 sections, 8 equations, 9 figures, 3 tables)

This paper contains 21 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The main framework of building LLaMA-MoE models. (a) The original FFNs in the LLaMA are split into different experts. (b) In the transformed LLaMA-MoE, the hidden states are processed by partially chosen experts instead of all experts. We continue to train the LLaMA-MoE to improve the performance.
  • Figure 2: Model performances on ARC-c and HellaSwag dataset and the training loss for LLaMA-MoE-3.0B and LLaMA-MoE-3.5B. The two models are trained with 200B tokens.
  • Figure 3: Model performances with different expert construction methods. Among four kinds of construction methods, Independent$_\text{Random}$ obtains the best result. We also present the ablation study of expert output re-scaling after 5B tokens of continual pre-training.
  • Figure 4: Model performances with different data sampling strategies. Among four sampling ways, Static$_{\text{Sheared}}$ achieves the best performance. However, it does not achieve the lowest training loss.
  • Figure 5: Data sampling weights variation on four domains. For Static$_{\text{Sheared}}$ and Static$_{\text{LLaMA}}$, the sampling weight is fixed among the training process, while the domain importance gradually changes for Dynamic$_{\text{Uniform}}$ and Dynamic$_{\text{LLaMA}}$. Both Dynamic$_{\text{Uniform}}$ and Dynamic$_{\text{LLaMA}}$ are two dynamic weight sampling strategies from xia2023sheared.
  • ...and 4 more figures