Table of Contents
Fetching ...

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava

TL;DR

This work tackles the bottleneck of large language model inference on CPUs, where attention requires expensive MAD-based all-pair dot products. It introduces NoMAD-Attention, a MAD-free approach that replaces dot-product computations with in-register lookups built on Product Quantization (PQ), 8-bit LUTs, and a reengineered key-cache layout to support batched SIMD operations. By learning per-head codebooks and reorganizing data layouts to maximize SIMD shuffle utilities, NoMAD-Attention preserves model quality while delivering substantial speedups, notably up to 2× on 16k-context, 4-bit quantized LLaMA-7B, without finetuning. The methods are validated on CPU hardware with AVX2, demonstrating practical, reproducible improvements that could broaden access to LLMs on commodity devices.

Abstract

Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

TL;DR

This work tackles the bottleneck of large language model inference on CPUs, where attention requires expensive MAD-based all-pair dot products. It introduces NoMAD-Attention, a MAD-free approach that replaces dot-product computations with in-register lookups built on Product Quantization (PQ), 8-bit LUTs, and a reengineered key-cache layout to support batched SIMD operations. By learning per-head codebooks and reorganizing data layouts to maximize SIMD shuffle utilities, NoMAD-Attention preserves model quality while delivering substantial speedups, notably up to 2× on 16k-context, 4-bit quantized LLaMA-7B, without finetuning. The methods are validated on CPU hardware with AVX2, demonstrating practical, reproducible improvements that could broaden access to LLMs on commodity devices.

Abstract

Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2 at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.
Paper Structure (25 sections, 7 equations, 8 figures, 3 algorithms)

This paper contains 25 sections, 7 equations, 8 figures, 3 algorithms.

Figures (8)

  • Figure 1: An illustrative comparison of memory layouts of the key cache of LLM attention and the key-code cache of NoMAD-Attention, and an illustration of how attention scores are computed through in-register lookups in NoMAD.
  • Figure 2: Value distributions of attention key embeddings of the LLaMA-2-7B model on samples of the WikiText-2 dataset. The first 4 attention heads in 4 different layers are shown, and all 128 dimensions of the key embeddings are used. Key embeddings have different distributions in value across different layers and heads, making it necessary for codebooks to be learned independently for each layer and head to minimize quantization error.
  • Figure 3: NoMAD-Attention-based LLMs maintain model quality with negligible degradation in perplexity compared to the original model at $8\times$ key cache compression / 4 bits per float in key / $d_\mathrm{sub}=1$. Dimensionality reduction-based PCA-Attention leads to significant model quality degradation even at $2\times$ key cache compression.
  • Figure 4: The efficiency of Attention-based and NoMAD-Attention-based CodeLLaMA-7B models on prompt processing and decoding. NoMAD-Attention-based models achieve significant speedup over Attention-based counterparts. At the context length of 16k, NoMAD-Attention-based CodeLlama-7B (4-bit weights) achieves $2\times$ speedup over the original CodeLlama-7B (4-bit weights).
  • Figure 5: The latency per query of Attention, PQ-Attention (8-bit code, $d_\mathrm{sub}=2$), and NoMAD-Attention (4-bit code, $d_\mathrm{sub}=1$) in computing attention scores and key caching for 16k queries at 16k context length. PQ-Attention yields limited speedup compared to Attention and incur the most overhead in key caching due to the large size of codebooks. NoMAD-Attention significantly reduces the latency of attention score computations over Attention.
  • ...and 3 more figures