Table of Contents
Fetching ...

FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu

TL;DR

FSMoE tackles the challenge of efficiently training sparse Mixture-of-Experts models by delivering a flexible, modular MoE framework with unified abstractions and online profiling, enabling near-optimal task scheduling across DP, MP, EP, and ESP parallelisms. It introduces an adaptive gradient partitioning method to overlap gradient aggregation with computation, and a pipeline-degree optimization that co-ordinates inter-node and intra-node communications with computations. Key contributions include modular MoE components (Gate, Order, I-Order, Dispatch, Combine, Expert), a generic front-end/back-end scheduler, and performance models that guide scheduling decisions via SLSQP optimization. Empirically, FSMoE achieves up to 1.42x speedups over optimized gating implementations and 1.19x–3.01x speedups over state-of-the-art MoE systems on configured layers and real-world GPT-2 and Mixtral models across two GPU clusters, demonstrating robust scalability and practical impact for large-scale MoE training.

Abstract

Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.

FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

TL;DR

FSMoE tackles the challenge of efficiently training sparse Mixture-of-Experts models by delivering a flexible, modular MoE framework with unified abstractions and online profiling, enabling near-optimal task scheduling across DP, MP, EP, and ESP parallelisms. It introduces an adaptive gradient partitioning method to overlap gradient aggregation with computation, and a pipeline-degree optimization that co-ordinates inter-node and intra-node communications with computations. Key contributions include modular MoE components (Gate, Order, I-Order, Dispatch, Combine, Expert), a generic front-end/back-end scheduler, and performance models that guide scheduling decisions via SLSQP optimization. Empirically, FSMoE achieves up to 1.42x speedups over optimized gating implementations and 1.19x–3.01x speedups over state-of-the-art MoE systems on configured layers and real-world GPT-2 and Mixtral models across two GPU clusters, demonstrating robust scalability and practical impact for large-scale MoE training.

Abstract

Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42 speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18-1.22 on 1458 MoE layers and 1.19-3.01 on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.
Paper Structure (25 sections, 16 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 16 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: A typical MoE structure with $E$ experts.
  • Figure 2: An example of $N_{\text{DP}}=N_{\text{MP}}=N_{\text{EP}}=N_{\text{ESP}}=2$. The attention is partitioned into two parts across MP groups, and the two experts are distributed to the two EP groups (GPU1 and GPU3, as well as GPU2 and GPU4) in EP, and each expert is further partitioned into two shards across the ESP group. The blue and green rectangles indicate the data tensors.
  • Figure 3: Backpropagation of four schedules in DP+MP+EP+ESP with the pipeline degree $r=4$ including (a) the default schedule, (b) an improved Tutel version (Tutel-Improved) where Gradient-AllReduce is overlapped with other dense operations using PipeMoE, (c) our proposed schedule FSMoE without partitioning the gradient, and (d) our proposed schedule FSMoE. The forward process is similar to the backpropagation except for the absence of the Gradient-AllReduce.
  • Figure 4: Four cases when scheduling the pipelining of ESP-AllGather/ESP-ReduceScatter, AlltoAll Dispatch/Combine, expert computations and Gradient-AllReduce with the pipeline degree $r=2$. (a) Case1: The AlltoAll communications are slower than intra-node communication and expert computations, but the inter-node communications (AlltoAll and Gradient-AllReduce) are not slower than intra-node communication and expert computations. (b) Case2: Expert computations are not slower than inter-node communications and intra-node communications. (c) Case3: The AlltoAll communications are not slower than intra-node communication and expert computations. (d) Case4: The intra-node communications (AllGather and ReduceScatter) are not slower than inter-node communications and expert computations.
  • Figure 6: Speedups of FSMoE, FSMoE-No-IIO, Tutel, Tutel-Improved, PipeMoE+Lina (PipeMoE with the additional schedule introduced by Lina li2023lina that partitions the gradient into fixed chunk size) over DeepSpeed-MoE (DS-MoE) on MoE models (GPT2-XL, Mixtral-7B and Mixtral-22B).
  • ...and 7 more figures