Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting
Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Zhengding Hu, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai
TL;DR
The paper tackles data movement as a primary bottleneck in large-scale MoE LLM serving. It conducts comprehensive, data-movement-centric profiling across four 2025 MoE models (200B–1000B) to extract six actionable insights, categorized as temporal and spatial patterns, and demonstrates how these insights inform both software strategies and hardware design. A wafer-scale GPU case study then shows two lightweight architectural enhancements—task distribution aware of expert placement and a data-driven predictor with local HBM caching—yielding significant throughput gains (up to 5.3x on DeepSeek and 3.1x on Qwen) and dramatic reductions in inter-die data movement. The work also provides extensive traces and a simulator to enable broader research, signaling a practical path toward scalable, efficient MoE serving across future heterogeneous architectures.
Abstract
Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. With our insights, we then demonstrate how to improve wafer-scale GPUs as a case study, and show that minor architectural modifications leveraging the insights achieve substantial performance gains, delivering 5.3x and 3.1x average speedups on DeepSeek V3 and Qwen3, respectively. Our work presents the first comprehensive data-centric analysis of large-scale MoE models and a concrete design study using the learned lessons, with profiling traces and simulation framework already open-sourced with $>$1k downloads. Our traces and results are publicly available at https://huggingface.co/datasets/core12345/MoE_expert_selection_trace
