Table of Contents
Fetching ...

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li

TL;DR

This paper tackles the All-to-All communication bottleneck in fine-grained Mixture-of-Experts (MoE) models by introducing BigMac, a Descend-Communicate-Communicate-Ascend (DCCA) structure that performs MoE exchanges at a reduced dimensionality. It redesigns each small expert with descending and ascending projections to maintain capacity while the routing happens on a downscaled representation, preserving model quality with a modest overhead. Empirical results show that BigMac converges as fast or faster than conventional MoEs and delivers up to 3.09× training speedups and up to 3.11× inference throughput improvements across Megatron, Tutel, and DeepSpeed-Inference, with robust performance on downstream tasks. The work demonstrates that algorithmic adjustments to MoE communication can dramatically boost efficiency for large-scale LLMs without sacrificing accuracy or requiring drastic system redesigns.

Abstract

The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09$\times$ for training and increases the throughput by up to 3.11$\times$ for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

TL;DR

This paper tackles the All-to-All communication bottleneck in fine-grained Mixture-of-Experts (MoE) models by introducing BigMac, a Descend-Communicate-Communicate-Ascend (DCCA) structure that performs MoE exchanges at a reduced dimensionality. It redesigns each small expert with descending and ascending projections to maintain capacity while the routing happens on a downscaled representation, preserving model quality with a modest overhead. Empirical results show that BigMac converges as fast or faster than conventional MoEs and delivers up to 3.09× training speedups and up to 3.11× inference throughput improvements across Megatron, Tutel, and DeepSpeed-Inference, with robust performance on downstream tasks. The work demonstrates that algorithmic adjustments to MoE communication can dramatically boost efficiency for large-scale LLMs without sacrificing accuracy or requiring drastic system redesigns.

Abstract

The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09 for training and increases the throughput by up to 3.11 for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

Paper Structure

This paper contains 19 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Convergence result comparison of MoE models with three structures. GPT-Fine-Grained takes 38.9 hours to reach the target perplexity of 13.69, while GPT-BigMac spends only 22.8 hours (1.7$\times$ faster). GPT-Vanilla fails to converge to the target perplexity under time budget.
  • Figure 2: The MoE layers of different structures. Here, N represents the number of experts in the Vanilla MoE model and ACT represents the activation function like ReLU. $W_{\downarrow}$ and $W_{\uparrow}$ represent the descending and ascending projection matrix of an expert, respectively. $W_{\downarrow}'$ and $W_{\uparrow}'$ represent the projection matrices introduced in BigMac.
  • Figure 3: Per-iteration training time comparison between the fine-grained structure and BigMac on Megatron. The models are constructed from four base models, namely GPT3-Medium, GPT3-XL, GPT3-2.7B, and GPT3-6.7B, ordered by the size of parameters.
  • Figure 4: Training time breakdown under different parallelism settings on Megatron. The labels ($ep$, $tp$) represent expert parallelism degree and tensor parallelism degree, respectively. For each group, the left bar is the result of GPT-Fine-Grained, and the right bar corresponds to GPT-BigMac. The numbers displayed on the right bar indicate the speedup in end-to-end latency.
  • Figure 5: Inference throughput comparison between GPT-Fine-Grained and GPT-BigMac on Megatron. We conduct experiments with different numbers of GPUs with expert parallelism degree $ep$ and $top\_k$ values. The numbers under x-axis represents different prompt lengths.
  • ...and 2 more figures