Table of Contents
Fetching ...

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

Songchen Ma, Hongyi Li, Weihao Zhang, Yonghao Tan, Pingcheng Dong, Yu Liu, Lan Liu, Yuzhong Jiao, Xuejiao Liu, Luhong Liang, Kwang-Ting Cheng

Abstract

Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

Abstract

Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.

Paper Structure

This paper contains 20 sections, 2 equations, 18 figures, 1 table, 2 algorithms.

Figures (18)

  • Figure 1: Typical template of (a) multi-chiplet-based AI accelerator. (b) Mixture-of-Experts network.
  • Figure 2: (a) Shapes for different models (b,c) Long-tail effect of MoE models under different batch sizes. The figure shows the number of tokens processed in a specific layer for DeepSeek-MoE-16Brajbhandari2022deepspeed on the Wikitext-2 datasetmerity2016pointer and Qwen3-30B-A3Byang2025qwen3 on WinoGrandesakaguchi2021winogrande. Experts on the x-axis are sorted by the number of tokens they process; the y-axis gives the token count per expert. The long-tail effect is more pronounced at smaller token numbers. R denotes different requests.
  • Figure 3: Fully sharded expert–data parallelism (example: four chiplets compute expert 1). The figure illustrates how a single expert is computed within one MCM: expert 1 is evenly sharded into slices across chiplets (E1-S1--E1-S4). black tokens denote the tokens that activate expert 1, while non-highlighted (gray/black) tokens are other buffered tokens on that chiplet that do not activate expert 1. Before computing expert 1, tokens are redispatched across chiplets to balance the number of black tokens per chiplet for load balancing. $R$ denotes a request, and Seq denotes the activated token sequence stored on a chiplet. During expert computation, chiplets can exchange data in two equivalent ways to cover all black tokens: (a) keep token sequences fixed and circulate expert-1 slices so each slice visits the chiplets holding black tokens; or (b) keep expert 1 slices fixed and circulate black-token sequences so they visit all chiplets containing slices of expert 1.
  • Figure 4: Micro-slice flow for overlapping D2D communication and computation (example: chiplet 1 while computing expert 1). Each expert slice (E1-S1--E1-S4) is further partitioned into micro-slices (M1--M4). (a) Baseline micro-slice overlap. In each step, chiplet 1 computes its local sequence using the current micro-slice, while concurrently receiving the next micro-slice from a neighbor chiplet and sending the just-computed micro-slice to the next chiplet; arrows indicate the D2D transfers that overlap with the compute stage. The weight buffer show the micro-slice storage on chiplet 1 over time, where each row is one micro-slice-sized buffer slot and each column is a time step. A colored cell indicates that the slot stores a micro-slice from a specific expert slice (see the color legend), and the numeral in the cell is the micro-slice index within that slice. Blank cells are free slots, and bold numerals mark the micro-slice being computed in that step. (b) Eager micro-slice usage. Chiplet 1 immediately forwards the micro-slice under computation and, in the next step, computes the most recently received micro-slice, so each micro-slice quickly traverses all chiplets and can be discarded earlier, reducing average weight-buffer occupancy.
  • Figure 5: Demonstration of paired-load policy and token buffering policy.
  • ...and 13 more figures