SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Rui Kong; Yuanchun Li; Qingtian Feng; Weijun Wang; Xiaozhou Ye; Ye Ouyang; Linghe Kong; Yunxin Liu

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, Yunxin Liu

TL;DR

This work tackles the challenge of deploying MoE-based large language models on memory-constrained devices by introducing SwapMoE, a framework that keeps a compact, dynamically updated set of Virtual Experts in main memory and maps them to the full expert pool. It combines importance-aware selection, masked gating, amortized/asynchronous expert updates, and profiling-guided memory planning with a genetic-search planner to meet memory budgets while preserving accuracy and reducing latency. Offline profiling and hardware-aware performance models guide configuration across layers, delivering tunable memory-accuracy-latency trade-offs demonstrated on edge devices with Switch Transformer and GPTSAN models. The results show substantial memory reductions (e.g., from $14.2$ GiB to $4.7$ GiB) and substantial latency improvements, enabling practical on-device serving of large MoE models with modest accuracy trade-offs.

Abstract

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

TL;DR

GiB to

GiB) and substantial latency improvements, enabling practical on-device serving of large MoE models with modest accuracy trade-offs.

Abstract

Paper Structure (24 sections, 3 equations, 11 figures, 2 tables)

This paper contains 24 sections, 3 equations, 11 figures, 2 tables.

Introduction
Background and Motivation
Limitations of Conventional Solutions
Activation Locality in MoE Models
Our Design: SwapMoE
Importantance-aware Virtual Experts Selection & Inference
Seamless Virtual Experts Update
Fine-grained Expert Profiling
Profiling-guided Memory Planning
Expert I/O Frequency
Layer Space Allocation
Evaluation
Experimental Setup
Overall Runtime Performance
Offline Planning Performance
...and 9 more sections

Figures (11)

Figure 1: SwitchT-16: Naive on-demand expert loading reduces memory, but also results in huge inference overhead.
Figure 2: SwitchT-16: Latency breakdown of MoE model inference with layer-wise memory swapping. The transmission of model weights consumes the majority of the time.
Figure 3: Weight loading may block computation when running MoE with layer-wise memory swapping. Due to the large size of MoE layers and sparse computation, loading the weights of a layer is always slower than computing the layer, which slows down the inference even if the weights are loaded asynchronously.
Figure 4: The workflow of SwapMoE. Given an off-the-shelf MoE model and a memory-constrained consumer device, we satisfy the constraint by executing the model with a smaller set of experts (Virtual Experts). The Virtual Experts are selected, used and updated seamlessly at runtime, and the memory allocation for the experts is determined at offline.
Figure 5: (a) Original gating of MoE: all experts may by used for inference; (b) Ours Masked Gating: only Virtual Experts will be used.
...and 6 more figures

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

TL;DR

Abstract

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Authors

TL;DR

Abstract

Table of Contents

Figures (11)