Table of Contents
Fetching ...

GRIN: GRadient-INformed MoE

Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

TL;DR

This work tackles the training challenges of sparse MoE models by introducing GRIN MoE, which uses gradient-informed gradient estimation (SparseMixer-v2) for expert routing and a parallelism strategy that avoids token dropping. The approach enables a top-2 routing over 16 experts without expert parallelism, achieving high training efficiency and strong performance on autoregressive language modeling, with 6.6B activated parameters yielding competitive results against much larger dense models. Through extensive evaluations and semi-controlled analyses, GRIN MoE demonstrates robustness in math and coding tasks and reveals how global load balancing and routing specialization contribute to gains. The study also outlines practical trade-offs and avenues for scaling MoE further, highlighting both engineering and algorithmic challenges ahead.

Abstract

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

GRIN: GRadient-INformed MoE

TL;DR

This work tackles the training challenges of sparse MoE models by introducing GRIN MoE, which uses gradient-informed gradient estimation (SparseMixer-v2) for expert routing and a parallelism strategy that avoids token dropping. The approach enables a top-2 routing over 16 experts without expert parallelism, achieving high training efficiency and strong performance on autoregressive language modeling, with 6.6B activated parameters yielding competitive results against much larger dense models. Through extensive evaluations and semi-controlled analyses, GRIN MoE demonstrates robustness in math and coding tasks and reveals how global load balancing and routing specialization contribute to gains. The study also outlines practical trade-offs and avenues for scaling MoE further, highlighting both engineering and algorithmic challenges ahead.

Abstract

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 163.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
Paper Structure (57 sections, 12 equations, 9 figures, 6 tables, 3 algorithms)

This paper contains 57 sections, 12 equations, 9 figures, 6 tables, 3 algorithms.

Figures (9)

  • Figure 1: MMLU accuracy and activated parameters.
  • Figure 2: Controlled Comparisons of SparseMixer-v2 and GShard on 16$\times$0.9B MoE.
  • Figure 3: Scaling of Different Parallelism Settings on 64 H100 gpus. The reported throughput for N experts (x-axis) refers to the average training throughput of a 3.8BxN top2 MoE.
  • Figure 4: Test Score on Translated 2024 GAOKAO Math-1.
  • Figure 5: Routing distribution on 2 million pretraining tokens. The model on the left is trained by main recipe and the right is trained by control recipe. The values are normalized per layer. The summation of the values in each row is 1 (perfectly balanced loading would result in a value of $0.0625$).
  • ...and 4 more figures