Table of Contents
Fetching ...

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, Yunxin Liu

TL;DR

This work tackles the challenge of deploying MoE-based large language models on memory-constrained devices by introducing SwapMoE, a framework that keeps a compact, dynamically updated set of Virtual Experts in main memory and maps them to the full expert pool. It combines importance-aware selection, masked gating, amortized/asynchronous expert updates, and profiling-guided memory planning with a genetic-search planner to meet memory budgets while preserving accuracy and reducing latency. Offline profiling and hardware-aware performance models guide configuration across layers, delivering tunable memory-accuracy-latency trade-offs demonstrated on edge devices with Switch Transformer and GPTSAN models. The results show substantial memory reductions (e.g., from $14.2$ GiB to $4.7$ GiB) and substantial latency improvements, enabling practical on-device serving of large MoE models with modest accuracy trade-offs.

Abstract

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

TL;DR

This work tackles the challenge of deploying MoE-based large language models on memory-constrained devices by introducing SwapMoE, a framework that keeps a compact, dynamically updated set of Virtual Experts in main memory and maps them to the full expert pool. It combines importance-aware selection, masked gating, amortized/asynchronous expert updates, and profiling-guided memory planning with a genetic-search planner to meet memory budgets while preserving accuracy and reducing latency. Offline profiling and hardware-aware performance models guide configuration across layers, delivering tunable memory-accuracy-latency trade-offs demonstrated on edge devices with Switch Transformer and GPTSAN models. The results show substantial memory reductions (e.g., from GiB to GiB) and substantial latency improvements, enabling practical on-device serving of large MoE models with modest accuracy trade-offs.

Abstract

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.
Paper Structure (24 sections, 3 equations, 11 figures, 2 tables)

This paper contains 24 sections, 3 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: SwitchT-16: Naive on-demand expert loading reduces memory, but also results in huge inference overhead.
  • Figure 2: SwitchT-16: Latency breakdown of MoE model inference with layer-wise memory swapping. The transmission of model weights consumes the majority of the time.
  • Figure 3: Weight loading may block computation when running MoE with layer-wise memory swapping. Due to the large size of MoE layers and sparse computation, loading the weights of a layer is always slower than computing the layer, which slows down the inference even if the weights are loaded asynchronously.
  • Figure 4: The workflow of SwapMoE. Given an off-the-shelf MoE model and a memory-constrained consumer device, we satisfy the constraint by executing the model with a smaller set of experts (Virtual Experts). The Virtual Experts are selected, used and updated seamlessly at runtime, and the memory allocation for the experts is determined at offline.
  • Figure 5: (a) Original gating of MoE: all experts may by used for inference; (b) Ours Masked Gating: only Virtual Experts will be used.
  • ...and 6 more figures