Table of Contents
Fetching ...

D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving

Haodong Wang, Qihua Zhou, Zicong Hong, Song Guo

TL;DR

D$^2$MoE tackles the memory and I/O bottlenecks of on-device MoE-based LLM serving by unifying dual routing and dynamic scheduling with a novel matryoshka weight quantization (MWQ). It dynamically allocates bit-widths per token, nests weights across bit-widths to avoid storing multiple versions, and employs a bit-width-aware I/O-Compute pipeline with the HEBF scheduling strategy to minimize I/O/computation bubbles. Offline preprocessing learns token-adaptive bit-widths and MWQ calibration, while the online engine manages cross-request execution under a memory budget, delivering up to 1.39× throughput and up to 53% peak memory reduction with accuracy close to INT8 baselines. This algorithm-system co-design enables practical, scalable, edge-device MoE inference for contemporary LLMs.

Abstract

The mixture of experts (MoE) model is a sparse variant of large language models (LLMs), designed to hold a better balance between intelligent capability and computational overhead. Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices, especially with the demands of on-device inference services. Recent research efforts often apply model compression techniques, such as quantization, pruning and merging, to restrict MoE complexity. Unfortunately, due to their predefined static model optimization strategies, they cannot always achieve the desired quality-overhead trade-off when handling multiple requests, finally degrading the on-device quality of service. These limitations motivate us to propose the D$^2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert. Specifically, inspired by the nested structure of matryoshka dolls, we propose the matryoshka weight quantization (MWQ) to progressively compress expert weights in a bit-nested manner and reduce the required runtime memory. On top of it, we further optimize the I/O-computation pipeline and design a heuristic scheduling algorithm following our hottest-expert-bit-first (HEBF) principle, which maximizes the expert parallelism between I/O and computation queue under constrained memory budgets, thus significantly reducing the idle temporal bubbles waiting for the experts to load. Evaluations on real edge devices show that D$^2$MoE improves the overall inference throughput by up to 1.39$\times$ and reduces the peak memory footprint by up to 53% over the latest on-device inference frameworks, while still preserving comparable serving accuracy as its INT8 counterparts.

D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving

TL;DR

DMoE tackles the memory and I/O bottlenecks of on-device MoE-based LLM serving by unifying dual routing and dynamic scheduling with a novel matryoshka weight quantization (MWQ). It dynamically allocates bit-widths per token, nests weights across bit-widths to avoid storing multiple versions, and employs a bit-width-aware I/O-Compute pipeline with the HEBF scheduling strategy to minimize I/O/computation bubbles. Offline preprocessing learns token-adaptive bit-widths and MWQ calibration, while the online engine manages cross-request execution under a memory budget, delivering up to 1.39× throughput and up to 53% peak memory reduction with accuracy close to INT8 baselines. This algorithm-system co-design enables practical, scalable, edge-device MoE inference for contemporary LLMs.

Abstract

The mixture of experts (MoE) model is a sparse variant of large language models (LLMs), designed to hold a better balance between intelligent capability and computational overhead. Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices, especially with the demands of on-device inference services. Recent research efforts often apply model compression techniques, such as quantization, pruning and merging, to restrict MoE complexity. Unfortunately, due to their predefined static model optimization strategies, they cannot always achieve the desired quality-overhead trade-off when handling multiple requests, finally degrading the on-device quality of service. These limitations motivate us to propose the DMoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert. Specifically, inspired by the nested structure of matryoshka dolls, we propose the matryoshka weight quantization (MWQ) to progressively compress expert weights in a bit-nested manner and reduce the required runtime memory. On top of it, we further optimize the I/O-computation pipeline and design a heuristic scheduling algorithm following our hottest-expert-bit-first (HEBF) principle, which maximizes the expert parallelism between I/O and computation queue under constrained memory budgets, thus significantly reducing the idle temporal bubbles waiting for the experts to load. Evaluations on real edge devices show that DMoE improves the overall inference throughput by up to 1.39 and reduces the peak memory footprint by up to 53% over the latest on-device inference frameworks, while still preserving comparable serving accuracy as its INT8 counterparts.

Paper Structure

This paper contains 25 sections, 7 equations, 14 figures, 4 tables, 2 algorithms.

Figures (14)

  • Figure 1: Traditional MoE single routing (expert ID only) vs. our D$^2$MoE dual routing (ID and bit-wdith).
  • Figure 2: Accuracy loss of expert quantization to INT1 across 10 samples from the Hellaswag dataset.
  • Figure 3: Comparison of expert I/O, computation, and inference latency with different request numbers.
  • Figure 4: The architecture overview of D$^2$MoE.
  • Figure 5: Comparison between fixed and dynamic bit-width allocation.
  • ...and 9 more figures