GRIN: GRadient-INformed MoE
Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
TL;DR
This work tackles the training challenges of sparse MoE models by introducing GRIN MoE, which uses gradient-informed gradient estimation (SparseMixer-v2) for expert routing and a parallelism strategy that avoids token dropping. The approach enables a top-2 routing over 16 experts without expert parallelism, achieving high training efficiency and strong performance on autoregressive language modeling, with 6.6B activated parameters yielding competitive results against much larger dense models. Through extensive evaluations and semi-controlled analyses, GRIN MoE demonstrates robustness in math and coding tasks and reveals how global load balancing and routing specialization contribute to gains. The study also outlines practical trade-offs and avenues for scaling MoE further, highlighting both engineering and algorithmic challenges ahead.
Abstract
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
