Table of Contents
Fetching ...

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda

TL;DR

This work tackles the parameter-inefficiency of sparse MoE training by enabling dense gradient propagation across all experts during training while performing sparse inference with top-$K$ routing and MI-guided load balancing. By replacing self-attention with Mixture of Attention heads, and using a Mutual Information loss to balance usage across $N$ experts, DS-MoE achieves parameter efficiency comparable to dense models while maintaining MoE-style throughput advantages. Empirical results across 1B–6B scales pretrained on a large CodeGen-enabled Pile show DS-MoE matching or exceeding dense-model performance with only $30$-$40\%$ of active parameters at inference and delivering substantial speedups (e.g., up to $1.86\times$ on vLLM against Mistral-7B). The findings demonstrate DS-MoE’s potential to deliver high throughput in both computation-bounded and I/O-bounded regimes, outperforming existing MoE methods in throughput while using fewer active parameters.

Abstract

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

TL;DR

This work tackles the parameter-inefficiency of sparse MoE training by enabling dense gradient propagation across all experts during training while performing sparse inference with top- routing and MI-guided load balancing. By replacing self-attention with Mixture of Attention heads, and using a Mutual Information loss to balance usage across experts, DS-MoE achieves parameter efficiency comparable to dense models while maintaining MoE-style throughput advantages. Empirical results across 1B–6B scales pretrained on a large CodeGen-enabled Pile show DS-MoE matching or exceeding dense-model performance with only - of active parameters at inference and delivering substantial speedups (e.g., up to on vLLM against Mistral-7B). The findings demonstrate DS-MoE’s potential to deliver high throughput in both computation-bounded and I/O-bounded regimes, outperforming existing MoE methods in throughput while using fewer active parameters.

Abstract

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4 compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4 times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to faster than similar dense models like Mistral-7B, and between and faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.
Paper Structure (28 sections, 8 equations, 6 figures, 6 tables)

This paper contains 28 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Subfigure (a) showcases the sizes and computational profiles of the Dense-3B, SMoE-5B, and DS-MoE-3B models, each achieving a comparable averaged task performance in Table \ref{['tab:base_performance']}. The computational cost is quantified by counting the number of active parameters engaged during inference. Subfigure (b) displays the performance of our DS-MoE-6B model in sparse inference, set against that of the traditional dense models and SMoE models. The radius of the icon circle reflects the total number of the model parameters.
  • Figure 2: Illustration of Dense Training of MoE models: Subfigure (a) illustrates the conventional sparse training method in MoE models, characterized by sparse gradient propagation in both the router and the experts. In subfigure (b), we detail the dense training strategy in our DS-MoE, which involves dense propagation of gradients for both routers and experts.
  • Figure 3: We assess the sparsity in our DS-MoEs by gradually deactivating experts to attain increasingly sparse configurations, monitoring until a significant performance drop occurs.
  • Figure 4: Expert Sampling Strategy Evaluation. We assess the impact of different expert sampling strategies on the Wikitext perplexity (PPL) using our DS-MoE-3B model.
  • Figure 5: Layer Utilization Assessment. We determine the average proportion of activated experts within both the self-attention and MLP layers. This analysis is conducted using the Wikitext dataset with our DS-MoE-3B model.
  • ...and 1 more figures