Table of Contents
Fetching ...

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

Jialong Li, Shreyansh Tripathi, Lakshay Rastogi, Yiming Lei, Rui Pan, Yiting Xia

TL;DR

Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios, which significantly accelerates inference and optimizes both model deployment and all-to-all communication scheduling.

Abstract

As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.07x degradation from the optimal. Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Aurora significantly accelerates inference, achieving speedups of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments. Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing methods.

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

TL;DR

Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios, which significantly accelerates inference and optimizes both model deployment and all-to-all communication scheduling.

Abstract

As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.07x degradation from the optimal. Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Aurora significantly accelerates inference, achieving speedups of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments. Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing methods.

Paper Structure

This paper contains 26 sections, 6 theorems, 9 equations, 23 figures, 2 tables, 1 algorithm.

Key Result

theorem 1

In the Exclusive + Homogeneous scenario, minimizing inference time is equivalent to minimizing communication time.

Figures (23)

  • Figure 1: MoE model structure.
  • Figure 1: Input parameters.
  • Figure 2: Aurora aims to minimize inference time across four different scenarios. It optimizes expert colocation, GPU assignment, and communication scheduling for each case. Aurora achieves optimal results in the first three scenarios and delivers suboptimal performance in the final one due to its NP-hardness.
  • Figure 3: (a) Colocating experts from the same model results in wasted GPU resources and increased inference time, as follow-up computations are delayed by synchronous all-to-all communications. (b) Colocating experts from different models enables full interleaving of computation and communication, resolving this issue.
  • Figure 4: (a) A big switch model representing the non-blocking inter-GPU network fabric. (b) Originally, the all-to-all communication of tokens from GPU 1 (red) and GPU 2 (yellow) to all other GPUs takes 3 units of time overall. (c) Optimizing the token order reduces transmission time to 2 units.
  • ...and 18 more figures

Theorems & Definitions (6)

  • theorem 1
  • theorem 2
  • theorem 3
  • theorem 4
  • theorem 5
  • theorem 6