SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

Jingyao Zhang; Jaewoo Park; Jongeun Lee; Elaheh Sadredini

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

Jingyao Zhang, Jaewoo Park, Jongeun Lee, Elaheh Sadredini

TL;DR

SAIL tackles the memory-bound challenge of LLM inference on CPUs by introducing LUT-based GEMV computed in SRAM near the cache (C-SRAM), enabling arbitrary-precision operations with minimal hardware overhead. It combines tensor-level scheduling, pattern-aware LUT optimization, and in-memory type conversion, plus a dedicated ISA extension for tiled GEMV, to support quantized LLMs efficiently. Experimental results show up to 10.7x speedups over CPU baselines and up to 7.04x cost-efficiency improvements versus GPU servers, with robust performance across quantization levels and batch sizes. This work offers a practical path to high-throughput, cost-effective CPU-based LLM inference, potentially broadening access to large-scale models while minimizing data movement and energy use.

Abstract

Large Language Model (LLM) inference requires substantial computational resources, yet CPU-based inference remains essential for democratizing AI due to the widespread availability of CPUs compared to specialized accelerators. However, efficient LLM inference on CPUs faces two fundamental challenges: (1) existing CPU architectures struggle with low-precision arithmetic required by quantized models, where optimal bit precision varies across models and layers; and (2) the memory-bound nature of the token generation phase creates severe performance bottlenecks. To address these challenges, we propose SAIL (SRAM-Accelerated Inference of LLMs), a CPU-based inference solution that efficiently supports arbitrary bit precisions with minimal overhead. SAIL integrates three key innovations: First, we introduce Batched LUT-based General Matrix-Vector Multiplication (LUT-GEMV) with SRAM-based processing-in-memory, enabling high data reuse through lookup tables and reducing memory movement. Second, our Pattern-Aware LUT optimization identifies and exploits redundancy in input activation patterns, reducing computation cycles by 13.8\%. Third, we develop an in-memory type conversion algorithm that leverages PIM's parallelism for efficient de-/quantization operations, alleviating pressure on CPU's vector units. Our architecture requires only 2\% hardware overhead and a single new instruction, while maintaining dual functionality as both compute and storage units. Experimental evaluations using a modified gem5 simulator demonstrate that SAIL achieves up to 10.7x speedup and 19.9x higher tokens per dollar compared to ARM Neoverse-N1 CPU baselines, and up to 7.04x better cost efficiency than NVIDIA V100 GPUs, establishing a practical path for efficient CPU-based LLM inference.

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

TL;DR

Abstract

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)