Table of Contents
Fetching ...

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai "Helen" Li, Yiran Chen

TL;DR

This work tackles the memory and bandwidth challenges of serving large mixture-of-experts models by exploiting activation sparsity with a data-aware approach. It introduces SiDA-MoE, a two-thread serving framework aided by an offline-trained hash function that predicts activated experts for each token, enabling dynamic offloading of inactive experts to main memory and reducing GPU memory usage. A lightweight predictor based on an LSTM with sparse attention is trained via truncated knowledge distillation to approximate router outputs, achieving up to 3.93x throughput improvement and up to 80% GPU memory savings with less than 1% accuracy loss on large Switch Transformer variants. These results demonstrate that data-aware, sparsity-exploiting inference can significantly improve scalability and efficiency for large MoE models under constrained hardware budgets.

Abstract

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE ($\textbf{S}$parsity-$\textbf{i}$nspired $\textbf{D}$ata-$\textbf{A}$ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $72\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

TL;DR

This work tackles the memory and bandwidth challenges of serving large mixture-of-experts models by exploiting activation sparsity with a data-aware approach. It introduces SiDA-MoE, a two-thread serving framework aided by an offline-trained hash function that predicts activated experts for each token, enabling dynamic offloading of inactive experts to main memory and reducing GPU memory usage. A lightweight predictor based on an LSTM with sparse attention is trained via truncated knowledge distillation to approximate router outputs, achieving up to 3.93x throughput improvement and up to 80% GPU memory savings with less than 1% accuracy loss on large Switch Transformer variants. These results demonstrate that data-aware, sparsity-exploiting inference can significantly improve scalability and efficiency for large MoE models under constrained hardware budgets.

Abstract

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE (parsity-nspired ata-ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to throughput increasing, up to latency reduction, and up to GPU memory saving with down to performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.
Paper Structure (25 sections, 6 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Diagram Showcasing the Architecture of MoE-based Transformers. Within each MoE layer only a limited number of experts are activated for inference.
  • Figure 2: Memory Efficiency of Switch Transformers on SST2. The $x$-axis represents the length of the sentence and the bar records the counts of sentences of corresponding length. The line represents the effective memory utilization for Switch Transformer on SST2 with a varied sentence length. Down to $5\%$ utilization can be observed for large models.
  • Figure 3: MoE Overhead on SST2. The bar depicts the percentage breakdown for MoE overhead and Ideal Inference time. Up to $72\%$ time on Switch-base-256 are occupied by MoE overhead, including expert selection, expert invocation, and communication. Notably, the occupation of expert selection overhead scales up as model size increases.
  • Figure 4: Expert Activation in Switch Transformers on SST2. The $x$-axis denotes sentence length, with bars illustrating the counts of given lengths. The line depicts the ration of idle experts. Notably, Switch-base-256 and Switch-base-128 activate less than $20\%$ and $40\%$ of their experts, respectively.
  • Figure 5: Overview of SiDA-MoE. SiDA-MoE contains two threads, the inference and hash-building thread, that run concurrently. As each batch ${\mathbb{X}}_j$ arrives, the hash-building thread constructs the expert hash table ${\mathbb{H}}_j$ and queues it. In tandem, the inference thread processes the preceding batch ${\mathbb{X}}_i$, dynamically managing experts in MoE layers based on the hash table ${\mathbb{H}}_{i}$.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Remark 1