Table of Contents
Fetching ...

Serving Large Language Models on Huawei CloudMatrix384

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao, Depeng Liang, Dong Cao, Juncheng Liu, Yongqiang Yang, Xiaolong Bai, Yi Li, Huaguo Xie, Huatao Wu, Zhibin Yu, Lv Chen, Hu Liu, Yujun Ding, Haipei Zhu, Jing Xia, Yi Xiong, Zhou Yu, Heng Liao

TL;DR

CloudMatrix384 presents a production-grade, fully peer-to-peer AI datacenter node designed to scale large-language-model serving. The core approach disaggregates prefill, decode, and caching, leveraging a high-bandwidth Unified Bus to enable LEP (EP320) and a disaggregated memory pool for KV caches and model blocks. Hardware-aware optimizations—including fused MoE/MLA operators, microbatch pipelines, and INT8 quantization—deliver state-of-the-art efficiency for DeepSeek-R1 inference, achieving 6,688 tokens/s per NPU in prefill and 1,943 tokens/s per NPU in decoding with sub-50 ms TPOT, while maintaining accuracy. These results underscore the practical impact of tightly integrated hardware/software co-design for scalable, latency-aware LLM serving, and outline future directions toward larger, more flexible supernodes and finer-grained disaggregation.

Abstract

The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.

Serving Large Language Models on Huawei CloudMatrix384

TL;DR

CloudMatrix384 presents a production-grade, fully peer-to-peer AI datacenter node designed to scale large-language-model serving. The core approach disaggregates prefill, decode, and caching, leveraging a high-bandwidth Unified Bus to enable LEP (EP320) and a disaggregated memory pool for KV caches and model blocks. Hardware-aware optimizations—including fused MoE/MLA operators, microbatch pipelines, and INT8 quantization—deliver state-of-the-art efficiency for DeepSeek-R1 inference, achieving 6,688 tokens/s per NPU in prefill and 1,943 tokens/s per NPU in decoding with sub-50 ms TPOT, while maintaining accuracy. These results underscore the practical impact of tightly integrated hardware/software co-design for scalable, latency-aware LLM serving, and outline future directions toward larger, more flexible supernodes and finer-grained disaggregation.

Abstract

The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.

Paper Structure

This paper contains 54 sections, 4 equations, 24 figures, 10 tables.

Figures (24)

  • Figure 1: Huawei’s CloudMatrix architecture vision reimagines AI datacenter infrastructure from the ground up. By dismantling traditional siloed designs, it enables full peer-to-peer disaggregation and pooling of CPUs, NPUs, memory, NICs, and other resources over a unified, ultra-high-performance networking, forming the foundation for scalable, AI-native datacenters.
  • Figure 2: Peer-to-peer hardware architecture of a CloudMatrix384 supernode, featuring an ultra-high-bandwidth Unified Bus (UB) plane for intra-supernode scaling, an RDMA plane for inter-supernode communication, and a Virtual Private Cloud (VPC) plane for integration with the datacenter network. All reported network bandwidth values denote unidirectional bandwidth.
  • Figure 3: Logical overview of the Huawei Ascend 910 chip, highlighting its dual-die architecture. All reported network bandwidth values denote unidirectional bandwidth.
  • Figure 4: Logical overview of an Ascend 910 node within the CloudMatrix384. All reported network bandwidth values denote unidirectional bandwidth.
  • Figure 5: The UB switch system in the CloudMatrix384. All reported network bandwidth values denote unidirectional bandwidth.
  • ...and 19 more figures