Table of Contents
Fetching ...

A dynamic parallel method for performance optimization on hybrid CPUs

Luo Yu, Liu Yucheng, Shen Haihao

TL;DR

The paper addresses inefficiencies in LLM inference on hybrid CPUs with imbalanced core capabilities by introducing a dynamic parallel method that balances workload before parallel execution. It builds a dual-component system—a CPU runtime that maintains per-core performance ratios $pr_i$ and a thread scheduler that partitions work using $s_i = \frac{pr_i}{\sum_j pr_j} \cdot s$—and updates these ratios with ${pr_i}' = \frac{pr_i}{\sum_j (t_i pr_j / t_j)}$ followed by smoothing ${pr_i} = \alpha \cdot pr_i + (1-\alpha) \cdot {pr_i}'$. Empirically, the approach yields substantial gains: over 90% memory bandwidth utilization on two hybrid CPUs, up to 3.7x speedup versus llama.cpp, and notable reductions in prefill (20–30%) and decode (9–22%) latencies during 4-bit LLM inference, with decode throughput around 16 tokens/s. These findings demonstrate practical impact for client devices using AIPC-style hybrids and motivate future exploration of coordinated compute dispatch across CPU, GPU, and NPU units to further lower latency and boost throughput.

Abstract

The AIPC concept is gaining popularity, and more and more hybrid CPUs will be running AI models on client devices. However, the current AI inference framework overlooks the imbalanced hardware capability of hybrid CPUs, leading to low inference performance. To address this issue, we have introduced a dynamic parallel method for hybrid CPUs, which significantly increases LLM inference performance by balancing the workload for each core of a hybrid CPU before the parallel work starts. This method has enabled Neural Speed to achieve more than 90% (on average) of memory bandwidth on two hybrid Intel CPUs.

A dynamic parallel method for performance optimization on hybrid CPUs

TL;DR

The paper addresses inefficiencies in LLM inference on hybrid CPUs with imbalanced core capabilities by introducing a dynamic parallel method that balances workload before parallel execution. It builds a dual-component system—a CPU runtime that maintains per-core performance ratios and a thread scheduler that partitions work using —and updates these ratios with followed by smoothing . Empirically, the approach yields substantial gains: over 90% memory bandwidth utilization on two hybrid CPUs, up to 3.7x speedup versus llama.cpp, and notable reductions in prefill (20–30%) and decode (9–22%) latencies during 4-bit LLM inference, with decode throughput around 16 tokens/s. These findings demonstrate practical impact for client devices using AIPC-style hybrids and motivate future exploration of coordinated compute dispatch across CPU, GPU, and NPU units to further lower latency and boost throughput.

Abstract

The AIPC concept is gaining popularity, and more and more hybrid CPUs will be running AI models on client devices. However, the current AI inference framework overlooks the imbalanced hardware capability of hybrid CPUs, leading to low inference performance. To address this issue, we have introduced a dynamic parallel method for hybrid CPUs, which significantly increases LLM inference performance by balancing the workload for each core of a hybrid CPU before the parallel work starts. This method has enabled Neural Speed to achieve more than 90% (on average) of memory bandwidth on two hybrid Intel CPUs.

Paper Structure

This paper contains 14 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: The dynamic LLM inference process
  • Figure 2: The latency and bandwidth of GEMM in different parallel methods
  • Figure 3: The latency of the prefill phase and the decode phase in Neural Speed (OpenMP and our method) and $llama.cpp$
  • Figure 4: The performance ratio of one P-core in the prefill phase and the decode phase