SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs
Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah
TL;DR
SparAMX targets CPU-based acceleration of LLM decoding by combining unstructured sparsity with Intel AMX on Sapphire Rapids. The authors implement dense and sparse GEMM kernels, plus INT8 variants, to reduce memory transfers during memory-bound decode while maintaining accuracy, and extend sparsity to the KV cache within attention. Key results include up to $1.42\times$ end-to-end speedup over stock PyTorch, up to $1.46\times$ gains for INT8 versus DeepSparse, and $1.14\times$ speedups for KV-cache sparsity with minimal accuracy loss at long contexts. The approach is general-purpose, open-source, and demonstrates practical CPU-based deployment potential for LLMs, especially in memory-bound decoding scenarios where GPUs are not available or cost/power constraints are tight.
Abstract
Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 \times$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX
