Table of Contents
Fetching ...

Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Xinglin Pan, Wenxiang Lin, Shaohuai Shi, Xiaowen Chu, Weinong Sun, Bo Li

TL;DR

Parm tackles the dominant communication cost in training large sparsely activated MoE models under MP+EP+ESP by introducing two dedicated schedules, S1 and S2, that pause MP to remove redundant work and replace multiple collectives with a unified one, enabling overlap between intra-node and inter-node communications. The authors formalize time costs with an alpha-beta performance model and devise an automatic selection mechanism to pick the best schedule per configuration. Through experiments on 8-GPU and 32-GPU setups and real-world BERT-Base and GPT-2 MoE variants, Parm achieves up to 5.77× speedups over DeepSpeed-MoE and ~3× on real models, demonstrating practical gains for scalable MoE training on commodity interconnects. By integrating the two schedules with an online decision policy, Parm delivers robust performance improvements without requiring advanced hardware, enhancing the viability of ever-larger sparsely-activated foundation models.

Abstract

Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive, we provide comprehensive theoretical analyses and derive an automatic and accurate solution to determine which schedule should be applied in different scenarios. Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE, achieving 1.13$\times$ to 5.77$\times$ speedup on 1296 manually configured MoE layers and approximately 3$\times$ improvement on two real-world MoE models based on BERT and GPT-2.

Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

TL;DR

Parm tackles the dominant communication cost in training large sparsely activated MoE models under MP+EP+ESP by introducing two dedicated schedules, S1 and S2, that pause MP to remove redundant work and replace multiple collectives with a unified one, enabling overlap between intra-node and inter-node communications. The authors formalize time costs with an alpha-beta performance model and devise an automatic selection mechanism to pick the best schedule per configuration. Through experiments on 8-GPU and 32-GPU setups and real-world BERT-Base and GPT-2 MoE variants, Parm achieves up to 5.77× speedups over DeepSpeed-MoE and ~3× on real models, demonstrating practical gains for scalable MoE training on commodity interconnects. By integrating the two schedules with an online decision policy, Parm delivers robust performance improvements without requiring advanced hardware, enhancing the viability of ever-larger sparsely-activated foundation models.

Abstract

Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive, we provide comprehensive theoretical analyses and derive an automatic and accurate solution to determine which schedule should be applied in different scenarios. Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE, achieving 1.13 to 5.77 speedup on 1296 manually configured MoE layers and approximately 3 improvement on two real-world MoE models based on BERT and GPT-2.
Paper Structure (22 sections, 14 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 14 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The communication time ratio varies across different configurations, ranging from 67.92% to 96.02% when using 32 Nvidia GeForce RTX 2080Ti GPUs. Detailed configurations for test cases can be found in Table \ref{['tab:moe-configs']}.
  • Figure 2: An example of $N_{\text{MP}}=N_{\text{EP}}=N_{\text{ESP}}=2$. The two experts ($\text{E}_1$ and $\text{E}_2$) are distributed to the two EP groups in EP, and each expert is further partitioned into two shards across the ESP group, that is $\text{E}_1\rightarrow[\text{E}_1^1$,$\text{E}_1^2$] are distributed to GPU 0 and GPU2 respectively and $\text{E}_2\rightarrow[\text{E}_2^1$,$\text{E}_2^2$]) are distributed to GPU 2 and GPU 4 respectively. The blue and green rectangles indicate the data tensors and the partially colored part represents a partial sum.
  • Figure 3: Three schedules in MP+EP+ESP including (a) the default schedule, (b) our proposed $S_1$ schedule, and (c) our proposed $S_2$ schedule. The yellow color indicates input or output data, the blue color indicates communication operations, and the green color indicates computation operations. Note that the split operations have no communication workload in feed-forward propagation, but they introduce the AllGather communication in backpropagation. That two blue rectangles are overlapped indicates the two operations can be executed in parallel and can be overlapped with each other.
  • Figure 4: Examples of communication patterns with $N_\text{EP}=N_\text{ESP}=2$ under different schedules. The solid arrows indicate requiring communications and the hollow arrow indicates no communications.
  • Figure 5: Example of SAA (4-way AlltoAll and 2-way AllGather). Red slices represent data received from AlltoAll at the current turn and blue slices represent the data received from AllGather. Blue arrow represents data transfer using AllGather.
  • ...and 2 more figures