Table of Contents
Fetching ...

WaferLLM: Large Language Model Inference at Wafer Scale

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai

TL;DR

The paper addresses the memory bandwidth bottleneck of LLM inference on GPUs by proposing a wafer-scale-based approach guided by the PLMR model. It introduces WaferLLM, a complete wafer-scale LLM inference system, together with MeshGEMM and MeshGEMV that are specifically designed for mesh NoC architectures, enabling efficient prefill GEMM and decode GEMV operations. The results show dramatic improvements: 100-200x faster than state-of-the-art massive-core systems, 606x faster GEMV than a single A100, and 10-20x end-to-end speedups over GPU clusters, with substantial energy efficiency gains. The work provides a foundational framework for wafer-scale LLMs and releases open-source tooling to encourage broader adoption and development across next-generation wafer-scale accelerators.

Abstract

Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves up to 200$\times$ higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606$\times$ faster and 16$\times$ more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20$\times$ speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.

WaferLLM: Large Language Model Inference at Wafer Scale

TL;DR

The paper addresses the memory bandwidth bottleneck of LLM inference on GPUs by proposing a wafer-scale-based approach guided by the PLMR model. It introduces WaferLLM, a complete wafer-scale LLM inference system, together with MeshGEMM and MeshGEMV that are specifically designed for mesh NoC architectures, enabling efficient prefill GEMM and decode GEMV operations. The results show dramatic improvements: 100-200x faster than state-of-the-art massive-core systems, 606x faster GEMV than a single A100, and 10-20x end-to-end speedups over GPU clusters, with substantial energy efficiency gains. The work provides a foundational framework for wafer-scale LLMs and releases open-source tooling to encourage broader adoption and development across next-generation wafer-scale accelerators.

Abstract

Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves up to 200 higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606 faster and 16 more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20 speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.

Paper Structure

This paper contains 31 sections, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Key components in LLM inference
  • Figure 2: Massive-scale mesh-based memory architecture
  • Figure 3: Prefill parallelism plan. $E_xF_y$ represents a matrix of shape $EF$, where the $E$ dimension is partitioned along the $x$-axis of cores, and $F$ along the $y$-axis of cores on a mesh.
  • Figure 4: Decode parallelism plan. $E^yF_x$ indicates the $E$ dimension is replicated along the $y$-axis, and $F$ is partitioned along the $x$-axis.
  • Figure 5: KV cache concatenation vs. KV cache shift
  • ...and 5 more figures