SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, Yunxin Liu
TL;DR
This work tackles the challenge of deploying MoE-based large language models on memory-constrained devices by introducing SwapMoE, a framework that keeps a compact, dynamically updated set of Virtual Experts in main memory and maps them to the full expert pool. It combines importance-aware selection, masked gating, amortized/asynchronous expert updates, and profiling-guided memory planning with a genetic-search planner to meet memory budgets while preserving accuracy and reducing latency. Offline profiling and hardware-aware performance models guide configuration across layers, delivering tunable memory-accuracy-latency trade-offs demonstrated on edge devices with Switch Transformer and GPTSAN models. The results show substantial memory reductions (e.g., from $14.2$ GiB to $4.7$ GiB) and substantial latency improvements, enabling practical on-device serving of large MoE models with modest accuracy trade-offs.
Abstract
Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.
