Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda
TL;DR
This work tackles the parameter-inefficiency of sparse MoE training by enabling dense gradient propagation across all experts during training while performing sparse inference with top-$K$ routing and MI-guided load balancing. By replacing self-attention with Mixture of Attention heads, and using a Mutual Information loss to balance usage across $N$ experts, DS-MoE achieves parameter efficiency comparable to dense models while maintaining MoE-style throughput advantages. Empirical results across 1B–6B scales pretrained on a large CodeGen-enabled Pile show DS-MoE matching or exceeding dense-model performance with only $30$-$40\%$ of active parameters at inference and delivering substantial speedups (e.g., up to $1.86\times$ on vLLM against Mistral-7B). The findings demonstrate DS-MoE’s potential to deliver high throughput in both computation-bounded and I/O-bounded regimes, outperforming existing MoE methods in throughput while using fewer active parameters.
Abstract
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.
