Table of Contents
Fetching ...

Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts

Shuqing Luo, Jie Peng, Pingzhi Li, Hanrui Wang, Tianlong Chen

TL;DR

MoE models offer scalable parameter growth but incur heavy memory and communication overhead on homogeneous hardware. HEXA-MoE introduces expert-specific operators (ESMM, ESS, ESTMM) and a pipeline-shared cache to enable in-place, memory-efficient MoE training and to adapt to heterogeneous hardware via data- and model-centric configurations. It achieves 10-48% memory reductions and 0.5–4.3x speedups on Swin-Transformer-MoE benchmarks, while heterogeneous-device allocation further minimizes latency. Together, these advances broaden practical MoE deployment by reducing resource demands and exploiting diverse computing resources.

Abstract

Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: ($1$) \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO} redundancy. ($2$) \textit{Adaptive Data- and Model-Centric Configurations} for different workload scales. Specifically, we introduce a pipeline-shared cache on each device to tackle the heavy memory consumption in the existing data-centric MoE library. Comprehensive experiments on the Swin-MoE benchmark consistently reveal the effectiveness of our \texttt{HEXA-MoE} framework, i.e., reducing $10\%\sim48\%$ memory consumption and achieving $0.5\sim4.3\times$ speed up compared to current state-of-the-art MoE libraries. Furthermore, we examine our \texttt{HEXA-MoE} with heterogeneous devices for both data- and model-centric settings. Promising results show that employing optimal parallel configuration with \texttt{HEXA-MoE} on heterogeneous devices can substantially minimize overall latency. Codes are available at https://github.com/UNITES-Lab/HEXA-MoE.

Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts

TL;DR

MoE models offer scalable parameter growth but incur heavy memory and communication overhead on homogeneous hardware. HEXA-MoE introduces expert-specific operators (ESMM, ESS, ESTMM) and a pipeline-shared cache to enable in-place, memory-efficient MoE training and to adapt to heterogeneous hardware via data- and model-centric configurations. It achieves 10-48% memory reductions and 0.5–4.3x speedups on Swin-Transformer-MoE benchmarks, while heterogeneous-device allocation further minimizes latency. Together, these advances broaden practical MoE deployment by reducing resource demands and exploiting diverse computing resources.

Abstract

Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: () \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO} redundancy. () \textit{Adaptive Data- and Model-Centric Configurations} for different workload scales. Specifically, we introduce a pipeline-shared cache on each device to tackle the heavy memory consumption in the existing data-centric MoE library. Comprehensive experiments on the Swin-MoE benchmark consistently reveal the effectiveness of our \texttt{HEXA-MoE} framework, i.e., reducing memory consumption and achieving speed up compared to current state-of-the-art MoE libraries. Furthermore, we examine our \texttt{HEXA-MoE} with heterogeneous devices for both data- and model-centric settings. Promising results show that employing optimal parallel configuration with \texttt{HEXA-MoE} on heterogeneous devices can substantially minimize overall latency. Codes are available at https://github.com/UNITES-Lab/HEXA-MoE.

Paper Structure

This paper contains 36 sections, 2 equations, 10 figures, 8 tables, 5 algorithms.

Figures (10)

  • Figure 1: Convergence Analysis.HEXA-MoE can significantly surpass Tutel on MoE training due to the specialized designs.
  • Figure 2: Comparison between conventional and expert-specific formulation for MoE computing. We take top-1 routing for illustration and present the corresponding relation of each formula in the MoE forward and backward propagation.
  • Figure 3: Illustration of the proposed operators. We take $10$ tokens, $4$ global experts, and tiling size $4$ as an example. For ESTMM, the $2$ input batches are in a re-indexed format, while for others both the raw batch and re-index vector are provided.
  • Figure 4: Visualization of the training pipeline and shared cache in data-centric setting. Each device copies the kept parameter shard to the cache before all gather communication, and after that it can access the whole parameters of an MoE layer.
  • Figure 5: Average memory usage for training Swin-Transformer-MoE models. We take 8 global experts and examine all cases from top-1 to top-8 routings. Experiments are conducted on 2 homogeneous GPUs using automatic mixed precision in PyTorch. The batch size is set to 40 for all cases. We record the average GPU memory consumption (GB) on each device.
  • ...and 5 more figures