Table of Contents
Fetching ...

Distributed Inference Performance Optimization for LLMs on CPUs

Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui

TL;DR

The paper tackles the challenge of running large language models on CPU-constrained hardware by proposing a distributed inference optimization framework implemented with oneAPI's oneCCL. It focuses on three concrete techniques: minimizing synchronization, enforcing a one-time synchronization per decoder layer, and eliminating memory copies through zero-copy data transfers. Evaluations on Qwen-72B show a latency of 140 ms per output token on a cluster of 4× Intel Xeon Scalable 8575C CPUs, faster than typical human token reading speeds. The work highlights practical gains for CPU-based LLM deployment in resource-limited environments and points to broader hardware support and open-source contributions as future directions.

Abstract

Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token.

Distributed Inference Performance Optimization for LLMs on CPUs

TL;DR

The paper tackles the challenge of running large language models on CPU-constrained hardware by proposing a distributed inference optimization framework implemented with oneAPI's oneCCL. It focuses on three concrete techniques: minimizing synchronization, enforcing a one-time synchronization per decoder layer, and eliminating memory copies through zero-copy data transfers. Evaluations on Qwen-72B show a latency of 140 ms per output token on a cluster of 4× Intel Xeon Scalable 8575C CPUs, faster than typical human token reading speeds. The work highlights practical gains for CPU-based LLM deployment in resource-limited environments and points to broader hardware support and open-source contributions as future directions.

Abstract

Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token.
Paper Structure (7 sections, 3 figures)

This paper contains 7 sections, 3 figures.

Figures (3)

  • Figure 1: Distributed inference based on oneCCL.
  • Figure 2: One time synchronization.
  • Figure 3: Minimize memory copy.