Table of Contents
Fetching ...

eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference

Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer

TL;DR

eMoE tackles the memory bottleneck of Mixture-of-Experts inference by predicting and loading only the most-needed experts and by scheduling loading with task-aware policies. It introduces a transformer-based expert predictor, periodic invocation to limit overhead, and task-sensitive loading to preserve accuracy while reducing memory and latency. Experimental results show up to 80% memory reduction and up to 17% latency improvement, with substantial gains in handling longer prompts and larger batches. The approach improves MoE inference practicality for large-language models by integrating recurrence-aware loading, SLO-conscious scheduling, and task-dependent routing tolerance.

Abstract

In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.

eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference

TL;DR

eMoE tackles the memory bottleneck of Mixture-of-Experts inference by predicting and loading only the most-needed experts and by scheduling loading with task-aware policies. It introduces a transformer-based expert predictor, periodic invocation to limit overhead, and task-sensitive loading to preserve accuracy while reducing memory and latency. Experimental results show up to 80% memory reduction and up to 17% latency improvement, with substantial gains in handling longer prompts and larger batches. The approach improves MoE inference practicality for large-language models by integrating recurrence-aware loading, SLO-conscious scheduling, and task-dependent routing tolerance.

Abstract

In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.

Paper Structure

This paper contains 28 sections, 3 equations, 20 figures, 3 tables, 1 algorithm.

Figures (20)

  • Figure 1: Inference time with memory consumption for dynamic expert loading during inference.
  • Figure 2: Inference time with memory consumption for different approaches.
  • Figure 3: Expert activation transition patterns between consecutive layers.
  • Figure 4: CDF of the number of generated tokens across different tasks
  • Figure 5: Accuracy with progressive applying of exact token-to-expert routing in MoE layers closer to output layer.
  • ...and 15 more figures