Table of Contents
Fetching ...

HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

TL;DR

HeterMoE addresses the challenge of training MoE models on clusters with heterogeneous GPUs by disaggregating attention and expert computations and introducing zebra parallelism to overlap their workloads. It also introduces Asymmetric Expert Assignment (Asym-EA) with a gather-and-squeeze optimizer to achieve fine-grained load balancing and reduce idle bubbles. The approach yields substantial performance gains, delivering up to 2.3x speed-ups over prior MoE systems and maintaining roughly 95% throughput even when half the GPUs are older generations. This work enables cost-effective, scalable MoE training on mixed hardware and can be integrated with existing optimization strategies to further improve utilization and efficiency.

Abstract

The Mixture-of-Experts (MoE) architecture has become increasingly popular as a method to scale up large language models (LLMs). To save costs, heterogeneity-aware training solutions have been proposed to utilize GPU clusters made up of both newer and older-generation GPUs. However, existing solutions are agnostic to the performance characteristics of different MoE model components (i.e., attention and expert) and do not fully utilize each GPU's compute capability. In this paper, we introduce HeterMoE, a system to efficiently train MoE models on heterogeneous GPUs. Our key insight is that newer GPUs significantly outperform older generations on attention due to architectural advancements, while older GPUs are still relatively efficient for experts. HeterMoE disaggregates attention and expert computation, where older GPUs are only assigned with expert modules. Through the proposed zebra parallelism, HeterMoE overlaps the computation on different GPUs, in addition to employing an asymmetric expert assignment strategy for fine-grained load balancing to minimize GPU idle time. Our evaluation shows that HeterMoE achieves up to 2.3x speed-up compared to existing MoE training systems, and 1.4x compared to an optimally balanced heterogeneity-aware solution. HeterMoE efficiently utilizes older GPUs by maintaining 95% training throughput on average, even with half of the GPUs in a homogeneous A40 cluster replaced with V100.

HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

TL;DR

HeterMoE addresses the challenge of training MoE models on clusters with heterogeneous GPUs by disaggregating attention and expert computations and introducing zebra parallelism to overlap their workloads. It also introduces Asymmetric Expert Assignment (Asym-EA) with a gather-and-squeeze optimizer to achieve fine-grained load balancing and reduce idle bubbles. The approach yields substantial performance gains, delivering up to 2.3x speed-ups over prior MoE systems and maintaining roughly 95% throughput even when half the GPUs are older generations. This work enables cost-effective, scalable MoE training on mixed hardware and can be integrated with existing optimization strategies to further improve utilization and efficiency.

Abstract

The Mixture-of-Experts (MoE) architecture has become increasingly popular as a method to scale up large language models (LLMs). To save costs, heterogeneity-aware training solutions have been proposed to utilize GPU clusters made up of both newer and older-generation GPUs. However, existing solutions are agnostic to the performance characteristics of different MoE model components (i.e., attention and expert) and do not fully utilize each GPU's compute capability. In this paper, we introduce HeterMoE, a system to efficiently train MoE models on heterogeneous GPUs. Our key insight is that newer GPUs significantly outperform older generations on attention due to architectural advancements, while older GPUs are still relatively efficient for experts. HeterMoE disaggregates attention and expert computation, where older GPUs are only assigned with expert modules. Through the proposed zebra parallelism, HeterMoE overlaps the computation on different GPUs, in addition to employing an asymmetric expert assignment strategy for fine-grained load balancing to minimize GPU idle time. Our evaluation shows that HeterMoE achieves up to 2.3x speed-up compared to existing MoE training systems, and 1.4x compared to an optimally balanced heterogeneity-aware solution. HeterMoE efficiently utilizes older GPUs by maintaining 95% training throughput on average, even with half of the GPUs in a homogeneous A40 cluster replaced with V100.

Paper Structure

This paper contains 20 sections, 1 theorem, 5 equations, 12 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

For a MoE model trained using HeterMoE's zebra parallelism, where all experts are placed on expert GPUs while the execution of tasks follow data dependency and stream sequential execution constraints, the following execution schedule on each GPU minimizes the total time of a training iteration: for for computation on expert GPUs: Following the above ordering of compute tasks, the dispatch and co

Figures (12)

  • Figure 1: MoE architecture and expert parallelism.
  • Figure 2: Speed-up of newer generation GPUs over older ones on attention and expert modules from Mixtral 8x7B jiang2024mixtral.
  • Figure 3: Major components of HeterMoE. Zebra parallelism overlaps the execution of attention and expert GPUs, while we introduce Asym-EA to minimize bubbles.
  • Figure 4: Zebra parallelism replaces expert parallelism in heterogeneous settings. It overlaps compute on attention and expert GPUs, as well as compute and communication within each GPU.
  • Figure 5: Demonstration of the optimal execution schedule of zebra parallelism, compared to swapping any of the two tasks. We show the forward attention computation and combine all-to-all of the second and third layers.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof