Table of Contents
Fetching ...

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

Jingyao Zhang, Jaewoo Park, Jongeun Lee, Elaheh Sadredini

TL;DR

SAIL tackles the memory-bound challenge of LLM inference on CPUs by introducing LUT-based GEMV computed in SRAM near the cache (C-SRAM), enabling arbitrary-precision operations with minimal hardware overhead. It combines tensor-level scheduling, pattern-aware LUT optimization, and in-memory type conversion, plus a dedicated ISA extension for tiled GEMV, to support quantized LLMs efficiently. Experimental results show up to 10.7x speedups over CPU baselines and up to 7.04x cost-efficiency improvements versus GPU servers, with robust performance across quantization levels and batch sizes. This work offers a practical path to high-throughput, cost-effective CPU-based LLM inference, potentially broadening access to large-scale models while minimizing data movement and energy use.

Abstract

Large Language Model (LLM) inference requires substantial computational resources, yet CPU-based inference remains essential for democratizing AI due to the widespread availability of CPUs compared to specialized accelerators. However, efficient LLM inference on CPUs faces two fundamental challenges: (1) existing CPU architectures struggle with low-precision arithmetic required by quantized models, where optimal bit precision varies across models and layers; and (2) the memory-bound nature of the token generation phase creates severe performance bottlenecks. To address these challenges, we propose SAIL (SRAM-Accelerated Inference of LLMs), a CPU-based inference solution that efficiently supports arbitrary bit precisions with minimal overhead. SAIL integrates three key innovations: First, we introduce Batched LUT-based General Matrix-Vector Multiplication (LUT-GEMV) with SRAM-based processing-in-memory, enabling high data reuse through lookup tables and reducing memory movement. Second, our Pattern-Aware LUT optimization identifies and exploits redundancy in input activation patterns, reducing computation cycles by 13.8\%. Third, we develop an in-memory type conversion algorithm that leverages PIM's parallelism for efficient de-/quantization operations, alleviating pressure on CPU's vector units. Our architecture requires only 2\% hardware overhead and a single new instruction, while maintaining dual functionality as both compute and storage units. Experimental evaluations using a modified gem5 simulator demonstrate that SAIL achieves up to 10.7x speedup and 19.9x higher tokens per dollar compared to ARM Neoverse-N1 CPU baselines, and up to 7.04x better cost efficiency than NVIDIA V100 GPUs, establishing a practical path for efficient CPU-based LLM inference.

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

TL;DR

SAIL tackles the memory-bound challenge of LLM inference on CPUs by introducing LUT-based GEMV computed in SRAM near the cache (C-SRAM), enabling arbitrary-precision operations with minimal hardware overhead. It combines tensor-level scheduling, pattern-aware LUT optimization, and in-memory type conversion, plus a dedicated ISA extension for tiled GEMV, to support quantized LLMs efficiently. Experimental results show up to 10.7x speedups over CPU baselines and up to 7.04x cost-efficiency improvements versus GPU servers, with robust performance across quantization levels and batch sizes. This work offers a practical path to high-throughput, cost-effective CPU-based LLM inference, potentially broadening access to large-scale models while minimizing data movement and energy use.

Abstract

Large Language Model (LLM) inference requires substantial computational resources, yet CPU-based inference remains essential for democratizing AI due to the widespread availability of CPUs compared to specialized accelerators. However, efficient LLM inference on CPUs faces two fundamental challenges: (1) existing CPU architectures struggle with low-precision arithmetic required by quantized models, where optimal bit precision varies across models and layers; and (2) the memory-bound nature of the token generation phase creates severe performance bottlenecks. To address these challenges, we propose SAIL (SRAM-Accelerated Inference of LLMs), a CPU-based inference solution that efficiently supports arbitrary bit precisions with minimal overhead. SAIL integrates three key innovations: First, we introduce Batched LUT-based General Matrix-Vector Multiplication (LUT-GEMV) with SRAM-based processing-in-memory, enabling high data reuse through lookup tables and reducing memory movement. Second, our Pattern-Aware LUT optimization identifies and exploits redundancy in input activation patterns, reducing computation cycles by 13.8\%. Third, we develop an in-memory type conversion algorithm that leverages PIM's parallelism for efficient de-/quantization operations, alleviating pressure on CPU's vector units. Our architecture requires only 2\% hardware overhead and a single new instruction, while maintaining dual functionality as both compute and storage units. Experimental evaluations using a modified gem5 simulator demonstrate that SAIL achieves up to 10.7x speedup and 19.9x higher tokens per dollar compared to ARM Neoverse-N1 CPU baselines, and up to 7.04x better cost efficiency than NVIDIA V100 GPUs, establishing a practical path for efficient CPU-based LLM inference.

Paper Structure

This paper contains 29 sections, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Efficiency gain comparison between LUT-based and bit-serial computing eckertNeuralCacheBitSerial2018a for 2-bit, 3-bit, and 4-bit quantization across various batch sizes. Dashed lines represent different bit-width quantizations.
  • Figure 2: LUT-based vector multiplication for a 4-bit input vector $[A, B, C]$ and weights $[W_0, W_1, W_2]$ using bit-serial computation with NBW = 3.
  • Figure 3: (a) Data flow of SAIL. (b) Operation of common CPU-based inference vs. SAIL.
  • Figure 4: (a) Detailed diagram of computation flow over the proposed architecture. With ping-pong cache, matrix with size of M$\times$M is written into one half of the cache and then read by the C-SRAM. The C-SRAM perform computation and aggregation of partial results to generate the final results of GEMV. (b) The pipeline diagram. The designed pipeline can be full without bubbles. The write, read and computation (including aggregation) can be fully overlapped.
  • Figure 5: Diagram of matrix mapping over compute SRAM. For common matrix multiplication (e.g., Q/K/V and feed-forward layer), weights at the same row are split into different C-SRAM arrays. For KV-cache related computation (e.g., one vector multiplied by a transposed matrix built on KV-cache entries), weights at the same column are split into different C-SRAM arrays.
  • ...and 8 more figures