Table of Contents
Fetching ...

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Jianbin Chang, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang

TL;DR

The paper tackles the challenge of efficiently training extremely large MoE models on thousands of GPUs. It introduces a novel 5D hybrid parallelism framework and MoE Parallel Folding to decouple attention and MoE layer parallelism, complemented by a flexible token dispatcher supporting token-dropping and token-dropless regimes. Empirical results show substantial gains in training efficiency (MFU up to 49.3% on Mixtral-8x22B and 39.0% on Qwen-2-57B-A14B) and strong scalability up to 1,024 GPUs with long-context capabilities up to 128K tokens, demonstrated on Megatron-Core. The work provides a practical path to scalable MoE training and contributes open-source tooling for broader adoption.

Abstract

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

TL;DR

The paper tackles the challenge of efficiently training extremely large MoE models on thousands of GPUs. It introduces a novel 5D hybrid parallelism framework and MoE Parallel Folding to decouple attention and MoE layer parallelism, complemented by a flexible token dispatcher supporting token-dropping and token-dropless regimes. Empirical results show substantial gains in training efficiency (MFU up to 49.3% on Mixtral-8x22B and 39.0% on Qwen-2-57B-A14B) and strong scalability up to 1,024 GPUs with long-context capabilities up to 128K tokens, demonstrated on Megatron-Core. The work provides a practical path to scalable MoE training and contributes open-source tooling for broader adoption.

Abstract

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

Paper Structure

This paper contains 27 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Illustration of parallelism mappings with MoE Parallel Folding.
  • Figure 2: Workflow of token dispatcher with Tensor Parallelism and Expert Parallelism.
  • Figure 3: Strong scaling experiments for various parallelism strategies by increasing number of GPUs up to 1024.
  • Figure 4: Context-scaling experiments by increasing context length and number of GPUs up to 128K and 1024.
  • Figure 5: MoE layer breakdown with different parallelism mappings. Marker * means the new parallelism mappings supported by MoE Parallel Folding.
  • ...and 4 more figures