Table of Contents
Fetching ...

MicroMoE: Fine-Grained Load Balancing for Mixture-of-Experts with Token Scheduling

Chenqi Zhao, Wenfei Wu, Linhai Song, Yuchen Xu

TL;DR

MoE training suffers from dynamic load imbalance across GPUs due to token routing, causing inefficiencies in expert-parallel distributed setups. The authors propose MicroEP, a token-scheduling driven expert-parallelism strategy that balances loads within every micro-batch by solving a linear program to determine replica loads and routing tokens accordingly, complemented by strategic expert placement (symmetric and asymmetric) and adaptive replacement. Built on MicroEP, MicroMoE delivers a distributed MoE system with state-of-the-art throughput gains (up to 47.6% over Megatron-LM) while maintaining near-perfect load balance across GPUs under varied workloads. The approach leverages minimal extra communication, overlaps scheduling with computation, and offers scalable, fine-grained load balancing suitable for large MoE models in real-world training environments.

Abstract

Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely impacting training efficiency. While previous research has attempted to address the load balancing challenge, existing solutions either compromise model accuracy or introduce additional system overhead. As a result, they fail to achieve fine-grained load balancing, which is crucial to optimizing training efficiency. We propose MicroEP, a novel parallelization strategy to achieve fine-grained load balancing in MoE systems. MicroEP is capable of achieving optimal load balancing in every micro-batch through efficient token scheduling across GPUs. Furthermore, we propose MicroMoE, an efficient distributed MoE training system with MicroEP's load balancing capabilities. Our experimental results demonstrate that MicroMoE improves the end-to-end training throughput by up to 47.6% compared with the state-of-the-art system, and almost consistently achieves optimal load balance among GPUs.

MicroMoE: Fine-Grained Load Balancing for Mixture-of-Experts with Token Scheduling

TL;DR

MoE training suffers from dynamic load imbalance across GPUs due to token routing, causing inefficiencies in expert-parallel distributed setups. The authors propose MicroEP, a token-scheduling driven expert-parallelism strategy that balances loads within every micro-batch by solving a linear program to determine replica loads and routing tokens accordingly, complemented by strategic expert placement (symmetric and asymmetric) and adaptive replacement. Built on MicroEP, MicroMoE delivers a distributed MoE system with state-of-the-art throughput gains (up to 47.6% over Megatron-LM) while maintaining near-perfect load balance across GPUs under varied workloads. The approach leverages minimal extra communication, overlaps scheduling with computation, and offers scalable, fine-grained load balancing suitable for large MoE models in real-world training environments.

Abstract

Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely impacting training efficiency. While previous research has attempted to address the load balancing challenge, existing solutions either compromise model accuracy or introduce additional system overhead. As a result, they fail to achieve fine-grained load balancing, which is crucial to optimizing training efficiency. We propose MicroEP, a novel parallelization strategy to achieve fine-grained load balancing in MoE systems. MicroEP is capable of achieving optimal load balancing in every micro-batch through efficient token scheduling across GPUs. Furthermore, we propose MicroMoE, an efficient distributed MoE training system with MicroEP's load balancing capabilities. Our experimental results demonstrate that MicroMoE improves the end-to-end training throughput by up to 47.6% compared with the state-of-the-art system, and almost consistently achieves optimal load balance among GPUs.

Paper Structure

This paper contains 41 sections, 4 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: An example of transformer, MoE, and expert parallelism.
  • Figure 2: Expert load distribution of GPT 32$\times$1.3B layer 20 in some training iterations.
  • Figure 3: Converting EP to MicroEP. The shape and color of a symbol indicate the source GPU and the assigned expert of a token. The bottom curves indicate EDP groups.
  • Figure 4: MicroMoE architecture.
  • Figure 5: The graph abstraction of an example expert placement. Color bars indicate edges (experts) between vertices (GPUs). $G_{max}$ contains GPU 0, 3. Expert 0 is entirely in $G_{max}$. Experts 1,3 partially intersect with $G_{max}$ and cannot distribute any load within $G_{max}$.
  • ...and 11 more figures

Theorems & Definitions (1)

  • proof