Table of Contents
Fetching ...

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, Mohsen Imani

TL;DR

T-SAR is presented, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications, establishing a practical approach for efficient LLM inference on edge platforms.

Abstract

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

TL;DR

T-SAR is presented, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications, establishing a practical approach for efficient LLM inference on edge platforms.

Abstract

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

Paper Structure

This paper contains 15 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Motivation for scalable ternary LLM acceleration.(a) Ternary LLMs provide 8$\times$ size reduction with minimal accuracy loss, making them suitable for edge deployment. (b) GEMM/GEMV dataflow: SOTA LUT-based kernels store TLUTs in DRAM, causing frequent memory access requests. (c) Memory access breakdown: TLUTs dominate system memory requests—over 75%—across models from 125M to 100B parameters, creating a major bottleneck for CPU inference.
  • Figure 2: Ternary LLMs: Architecture and bottleneck analysis.(a) Ternary transformer with BitLinear layers. (b) BitLinear layer workflow including quantization and LUTGEMM. (c) BitNet-b1.58-2B-4T memory footprints: TLUTs, though tiny in RAM, dominate memory accesses. (d) BitLinear GEMV time breakdown: Memory R/W dominates execution.
  • Figure 3: Prior LUT-based CPU solution vs. T-SAR. (a) Prior: precomputed LUTs loaded from DRAM. (b) T-SAR: on-the-fly compressed LUTs generated in SIMD registers.
  • Figure 4: Proposed LUT GEMV Algorithm for LUT compression, matching the LUT size to the data-path.
  • Figure 5: T-SAR's LUT-based kernel framework overview.
  • ...and 5 more figures