Table of Contents
Fetching ...

Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting

Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Zhengding Hu, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai

TL;DR

The paper tackles data movement as a primary bottleneck in large-scale MoE LLM serving. It conducts comprehensive, data-movement-centric profiling across four 2025 MoE models (200B–1000B) to extract six actionable insights, categorized as temporal and spatial patterns, and demonstrates how these insights inform both software strategies and hardware design. A wafer-scale GPU case study then shows two lightweight architectural enhancements—task distribution aware of expert placement and a data-driven predictor with local HBM caching—yielding significant throughput gains (up to 5.3x on DeepSeek and 3.1x on Qwen) and dramatic reductions in inter-die data movement. The work also provides extensive traces and a simulator to enable broader research, signaling a practical path toward scalable, efficient MoE serving across future heterogeneous architectures.

Abstract

Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. With our insights, we then demonstrate how to improve wafer-scale GPUs as a case study, and show that minor architectural modifications leveraging the insights achieve substantial performance gains, delivering 5.3x and 3.1x average speedups on DeepSeek V3 and Qwen3, respectively. Our work presents the first comprehensive data-centric analysis of large-scale MoE models and a concrete design study using the learned lessons, with profiling traces and simulation framework already open-sourced with $>$1k downloads. Our traces and results are publicly available at https://huggingface.co/datasets/core12345/MoE_expert_selection_trace

Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting

TL;DR

The paper tackles data movement as a primary bottleneck in large-scale MoE LLM serving. It conducts comprehensive, data-movement-centric profiling across four 2025 MoE models (200B–1000B) to extract six actionable insights, categorized as temporal and spatial patterns, and demonstrates how these insights inform both software strategies and hardware design. A wafer-scale GPU case study then shows two lightweight architectural enhancements—task distribution aware of expert placement and a data-driven predictor with local HBM caching—yielding significant throughput gains (up to 5.3x on DeepSeek and 3.1x on Qwen) and dramatic reductions in inter-die data movement. The work also provides extensive traces and a simulator to enable broader research, signaling a practical path toward scalable, efficient MoE serving across future heterogeneous architectures.

Abstract

Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. With our insights, we then demonstrate how to improve wafer-scale GPUs as a case study, and show that minor architectural modifications leveraging the insights achieve substantial performance gains, delivering 5.3x and 3.1x average speedups on DeepSeek V3 and Qwen3, respectively. Our work presents the first comprehensive data-centric analysis of large-scale MoE models and a concrete design study using the learned lessons, with profiling traces and simulation framework already open-sourced with 1k downloads. Our traces and results are publicly available at https://huggingface.co/datasets/core12345/MoE_expert_selection_trace

Paper Structure

This paper contains 35 sections, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: MoE LLM models sizes and release dates. Bubble size indicates the number of experts in each layer. Prior studies zhu2025megascale-infertairin2025emoeskliar2024mixturechitty2025lexi provide limited analysis of smaller models from narrow perspectives, while our work presents the first comprehensive analysis of multiple unstudied SOTA models.
  • Figure 2: Latency breakdown for different data movement in DeepSeekV3, modeled after various serving configurations, including ds_sglang_16h20sglang-deepseek-blogpostliu2024deepseek.
  • Figure 3: Inference process of MoE LLMs and the categorization method for our proposed data-centric profiling approach.
  • Figure 4: Layer-level temporal correlation heatmaps for (a) Deepseek and (b) Qwen, together with (c) statistical results across all layers, reveal a strong correlation between expert selection in layer $N$ and that in layer $N+1$.
  • Figure 5: Token-level temporal correlation heatmaps for (a) Deepseek, (b) Llama, and (c) Qwen, together with (d) statistical results across all layers, demonstrate a strong correlation between the expert selection for consecutive tokens $t$ and $t+1$.
  • ...and 9 more figures