A dynamic parallel method for performance optimization on hybrid CPUs
Luo Yu, Liu Yucheng, Shen Haihao
TL;DR
The paper addresses inefficiencies in LLM inference on hybrid CPUs with imbalanced core capabilities by introducing a dynamic parallel method that balances workload before parallel execution. It builds a dual-component system—a CPU runtime that maintains per-core performance ratios $pr_i$ and a thread scheduler that partitions work using $s_i = \frac{pr_i}{\sum_j pr_j} \cdot s$—and updates these ratios with ${pr_i}' = \frac{pr_i}{\sum_j (t_i pr_j / t_j)}$ followed by smoothing ${pr_i} = \alpha \cdot pr_i + (1-\alpha) \cdot {pr_i}'$. Empirically, the approach yields substantial gains: over 90% memory bandwidth utilization on two hybrid CPUs, up to 3.7x speedup versus llama.cpp, and notable reductions in prefill (20–30%) and decode (9–22%) latencies during 4-bit LLM inference, with decode throughput around 16 tokens/s. These findings demonstrate practical impact for client devices using AIPC-style hybrids and motivate future exploration of coordinated compute dispatch across CPU, GPU, and NPU units to further lower latency and boost throughput.
Abstract
The AIPC concept is gaining popularity, and more and more hybrid CPUs will be running AI models on client devices. However, the current AI inference framework overlooks the imbalanced hardware capability of hybrid CPUs, leading to low inference performance. To address this issue, we have introduced a dynamic parallel method for hybrid CPUs, which significantly increases LLM inference performance by balancing the workload for each core of a hybrid CPU before the parallel work starts. This method has enabled Neural Speed to achieve more than 90% (on average) of memory bandwidth on two hybrid Intel CPUs.
