Table of Contents
Fetching ...

Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

Dibakar Gope, David Mansell, Danny Loh, Ian Bratt

TL;DR

This work addresses the challenge of efficient LLM inference on Arm CPUs by delivering highly optimized GEMV and GEMM kernels tailored for low-bit quantization and by introducing a group-wise non-uniform codebook quantization strategy. The kernel design emphasizes SIMD-aware weight packing, fast dequantization fused with computation, and assembly-level optimizations to maximize MAC utilization on Arm architectures, alongside a quantization scheme where a small set of codebooks per weight group preserves accuracy at ultra-low bits. Empirically, the approach yields substantial throughput gains, achieving approximately 3–3.2× faster prompt processing and around 2× faster autoregressive decoding for 4-bit models compared with a LLaMA.cpp baseline, with 2-bit schemes showing favorable compute/memory trade-offs. The proposed non-uniform codebook quantization further improves text generation quality (perplexity) while maintaining competitive throughput, demonstrating a Pareto-optimal balance between model size, speed, and accuracy on commodity Arm CPUs; the kernels and quantization methods are available at the project repository.

Abstract

Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to LLaMA.cpp-based solution. The optimized kernels are available at https://github.com/ggerganov/llama.cpp.

Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

TL;DR

This work addresses the challenge of efficient LLM inference on Arm CPUs by delivering highly optimized GEMV and GEMM kernels tailored for low-bit quantization and by introducing a group-wise non-uniform codebook quantization strategy. The kernel design emphasizes SIMD-aware weight packing, fast dequantization fused with computation, and assembly-level optimizations to maximize MAC utilization on Arm architectures, alongside a quantization scheme where a small set of codebooks per weight group preserves accuracy at ultra-low bits. Empirically, the approach yields substantial throughput gains, achieving approximately 3–3.2× faster prompt processing and around 2× faster autoregressive decoding for 4-bit models compared with a LLaMA.cpp baseline, with 2-bit schemes showing favorable compute/memory trade-offs. The proposed non-uniform codebook quantization further improves text generation quality (perplexity) while maintaining competitive throughput, demonstrating a Pareto-optimal balance between model size, speed, and accuracy on commodity Arm CPUs; the kernels and quantization methods are available at the project repository.

Abstract

Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to LLaMA.cpp-based solution. The optimized kernels are available at https://github.com/ggerganov/llama.cpp.
Paper Structure (20 sections, 7 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Group-wise quantization, in which weights are divided into groups, each with V elements and its own scale factor. We use a group size (V) of $32$ here. Given a weight tensor, a group of $32$ floating-point weights is quantized into $4$-bit integer values using a local scale factor. The next set of $32$ consecutive weights are then quantized to $4$ bits using a different scale factor, and this process is repeated until the entire weight tensor is covered. We use FP16 precision for scale factors.
  • Figure 2: Group processing steps in a reference baseline group-wise quantized dot product kernel.
  • Figure 3: SIMD-aware weight reorder to minimize scalar operations in GEMV and GEMM kernels.
  • Figure 4: Fast decompression path for unpacking 4-bit nibbles into signed 8-bit weights in GEMV and GEMM kernels.
  • Figure 5: Fine-grained assignment of codebooks to various groups in group-wise codebook-based quantization. Each group finds the closest codebook of the $C$ codebooks ($C_1$, $C_2$, ..., $C_m$) that best represents its values and quantizes its high-precision values to the codebook centroids using $2$-bit indices.
  • ...and 2 more figures