NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava
TL;DR
This work tackles the bottleneck of large language model inference on CPUs, where attention requires expensive MAD-based all-pair dot products. It introduces NoMAD-Attention, a MAD-free approach that replaces dot-product computations with in-register lookups built on Product Quantization (PQ), 8-bit LUTs, and a reengineered key-cache layout to support batched SIMD operations. By learning per-head codebooks and reorganizing data layouts to maximize SIMD shuffle utilities, NoMAD-Attention preserves model quality while delivering substantial speedups, notably up to 2× on 16k-context, 4-bit quantized LLaMA-7B, without finetuning. The methods are validated on CPU hardware with AVX2, demonstrating practical, reproducible improvements that could broaden access to LLMs on commodity devices.
Abstract
Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.
