Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules
Xinglin Pan, Wenxiang Lin, Shaohuai Shi, Xiaowen Chu, Weinong Sun, Bo Li
TL;DR
Parm tackles the dominant communication cost in training large sparsely activated MoE models under MP+EP+ESP by introducing two dedicated schedules, S1 and S2, that pause MP to remove redundant work and replace multiple collectives with a unified one, enabling overlap between intra-node and inter-node communications. The authors formalize time costs with an alpha-beta performance model and devise an automatic selection mechanism to pick the best schedule per configuration. Through experiments on 8-GPU and 32-GPU setups and real-world BERT-Base and GPT-2 MoE variants, Parm achieves up to 5.77× speedups over DeepSpeed-MoE and ~3× on real models, demonstrating practical gains for scalable MoE training on commodity interconnects. By integrating the two schedules with an online decision policy, Parm delivers robust performance improvements without requiring advanced hardware, enhancing the viability of ever-larger sparsely-activated foundation models.
Abstract
Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive, we provide comprehensive theoretical analyses and derive an automatic and accurate solution to determine which schedule should be applied in different scenarios. Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE, achieving 1.13$\times$ to 5.77$\times$ speedup on 1296 manually configured MoE layers and approximately 3$\times$ improvement on two real-world MoE models based on BERT and GPT-2.
