Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs
Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke
TL;DR
The paper tackles the bottleneck of deploying ultra-low-bit LLMs on resource-constrained hardware by designing 1-bit and 2-bit GEMM microkernels for CPUs with an up-convert-to-int8 path and a novel 2-bit weight layout, then extending these to Intel Xe2 GPUs with fused quantization in a mixed-precision GEMM design. These kernels are integrated into PyTorch-TPP and vLLM, enabling end-to-end inference that achieves up to $7\times$ speedups over 16-bit baselines on CPUs and up to $6.3\times$ reductions in end-to-end latency on Xe2 GPUs, with substantial improvements over prior runtimes like bitnet.cpp. The work provides both practical performance results and formal roofline-based models to explain the observed speeds, demonstrating that ultra-low-bit inference can approach high-end GPU performance on AI-PCs and discrete client GPUs. The findings significantly advance efficient deployment of ultra-low-bit LLMs in edge/AI-PC contexts and guide future work toward ARM/SVE implementations and broader platform support.
Abstract
The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. We then extend this work to Intel GPUs where we design and implement mixed precision, 2-bit GEMM kernels, and show their performance to be close to optimal. We integrated our optimized Xe2 kernels in the vLLM framework as a quantization plugin and evaluated end-to-end LLM inference results for a range of LLM models and Xe2 GPUs. Depending on the model and platform, we see a 4x - 8x reduction in GEMM time compared to the BF16 case, and we get up to 6.3x speedup in end-to-end latency compared to the BF16 execution. Our optimized runtime advances the state of LLM inference on AI PCs and Intel Xe GPUs, paving the way for efficient deployment of ultra-low-bit LLM models.
