Table of Contents
Fetching ...

Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs

Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke

TL;DR

The paper tackles the bottleneck of deploying ultra-low-bit LLMs on resource-constrained hardware by designing 1-bit and 2-bit GEMM microkernels for CPUs with an up-convert-to-int8 path and a novel 2-bit weight layout, then extending these to Intel Xe2 GPUs with fused quantization in a mixed-precision GEMM design. These kernels are integrated into PyTorch-TPP and vLLM, enabling end-to-end inference that achieves up to $7\times$ speedups over 16-bit baselines on CPUs and up to $6.3\times$ reductions in end-to-end latency on Xe2 GPUs, with substantial improvements over prior runtimes like bitnet.cpp. The work provides both practical performance results and formal roofline-based models to explain the observed speeds, demonstrating that ultra-low-bit inference can approach high-end GPU performance on AI-PCs and discrete client GPUs. The findings significantly advance efficient deployment of ultra-low-bit LLMs in edge/AI-PC contexts and guide future work toward ARM/SVE implementations and broader platform support.

Abstract

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. We then extend this work to Intel GPUs where we design and implement mixed precision, 2-bit GEMM kernels, and show their performance to be close to optimal. We integrated our optimized Xe2 kernels in the vLLM framework as a quantization plugin and evaluated end-to-end LLM inference results for a range of LLM models and Xe2 GPUs. Depending on the model and platform, we see a 4x - 8x reduction in GEMM time compared to the BF16 case, and we get up to 6.3x speedup in end-to-end latency compared to the BF16 execution. Our optimized runtime advances the state of LLM inference on AI PCs and Intel Xe GPUs, paving the way for efficient deployment of ultra-low-bit LLM models.

Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs

TL;DR

The paper tackles the bottleneck of deploying ultra-low-bit LLMs on resource-constrained hardware by designing 1-bit and 2-bit GEMM microkernels for CPUs with an up-convert-to-int8 path and a novel 2-bit weight layout, then extending these to Intel Xe2 GPUs with fused quantization in a mixed-precision GEMM design. These kernels are integrated into PyTorch-TPP and vLLM, enabling end-to-end inference that achieves up to speedups over 16-bit baselines on CPUs and up to reductions in end-to-end latency on Xe2 GPUs, with substantial improvements over prior runtimes like bitnet.cpp. The work provides both practical performance results and formal roofline-based models to explain the observed speeds, demonstrating that ultra-low-bit inference can approach high-end GPU performance on AI-PCs and discrete client GPUs. The findings significantly advance efficient deployment of ultra-low-bit LLMs in edge/AI-PC contexts and guide future work toward ARM/SVE implementations and broader platform support.

Abstract

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. We then extend this work to Intel GPUs where we design and implement mixed precision, 2-bit GEMM kernels, and show their performance to be close to optimal. We integrated our optimized Xe2 kernels in the vLLM framework as a quantization plugin and evaluated end-to-end LLM inference results for a range of LLM models and Xe2 GPUs. Depending on the model and platform, we see a 4x - 8x reduction in GEMM time compared to the BF16 case, and we get up to 6.3x speedup in end-to-end latency compared to the BF16 execution. Our optimized runtime advances the state of LLM inference on AI PCs and Intel Xe GPUs, paving the way for efficient deployment of ultra-low-bit LLM models.

Paper Structure

This paper contains 21 sections, 4 equations, 17 figures.

Figures (17)

  • Figure 1: Left: Unpacking the int2 VNNI4-interleaved format to int8 VNNI4. We pack a $[k4][m4]$ 2-bit subtensor in a 32-bit value (see dotted sub-tensor in top) and with a full vector load of 256-bits we can read 128 2-bit entries, which effectively form an $[M8][k4][m4]$ tensor. With 1 logical shift + 2 logical AND + 4 byte-shuffles we get as output 4 256-bit vectors, each holding an $[m8][k4]$ int8 subtensor which is in VNNI4 layout. Right: AVX2 GEMM microkernel with int2 weights (matrix $A^{M\times K}$), int8 activations (matrix $B^{N\times K}$) and vnni-INT8 FMAs ($M=32$, $N=1$, $K=4$). Matrix $A$ uses the VNNI4-interleaved layout $[M8][k4][m4]$ shown to left.
  • Figure 2: AVX2 GEMM microkernel with int1 weights and vnni-INT8 FMAs ($M=32$, $N=1$, $K=4$). Matrix $A$ uses conventional VNNI4 layout $[m8][k4]$.
  • Figure 3: Xe2 GPU GEMM kernel with int2 weights ($A$) and BF16 activations ($B$ and $C$). We split the output matrix $C^{M\times N}$ into tiles with size $wg\_tile\_m\times wg\_tile\_n$ and each workgroup will calculate a sub-matrix $wg\_tile\_m\times wg\_tile\_n$ (see yellow box at Left). Subsequently, this sub-matrix will be continuously divided into multiple tiles, with dimensions $sg\_tile\_m\times sg\_tile\_n$ (see dark-orange rectangular tile). These tiles will then be assigned to subgroups. Finally, the corresponding subgroup GEMM micro-kernel and the involved tile operations will be mapped to the actual Xe2 instructions, such as 2D-load and DPAS instructions (see GEMM microkernel at Right). The quantization of $B$ is fused in the GEMM: while the loaded $B$ sub-matrix is in BF16 (dark-blue box), we quantize the entries to int8 using in-register operations (light-blue assembly box). With rectangular shapes for the $sg\_tile\_m\times sg\_tile\_n$ tiles we re-use the quantized $B$ and amortize the corresponding overhead.
  • Figure 4: Effective bandwidth rooflines for 2-bit and 1-bit GEMV microkernels considering $P$ and $E$ cores.
  • Figure 5: Attained GEMV bandwidth on ARL for various matrix shapes and precisions (int/int2/int1).
  • ...and 12 more figures