Table of Contents
Fetching ...

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee

TL;DR

CodeGEMM introduces a codebook-centric GEMM kernel for codebook-based quantized LLMs, replacing dequantization with a Psumbook of precomputed centroid-activation inner products. This design reduces on-chip cache pressure and computational redundancy, enabling faster 2-bit inference on large models like Llama-3.1, with substantial throughput gains (up to ~9x on 70B) while preserving accuracy; it also allows exploring latency-memory-accuracy trade-offs via tunable hyperparameters. The work demonstrates scalability to large models and highlights practical considerations and limitations related to on-chip memory and batch-size.

Abstract

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

TL;DR

CodeGEMM introduces a codebook-centric GEMM kernel for codebook-based quantized LLMs, replacing dequantization with a Psumbook of precomputed centroid-activation inner products. This design reduces on-chip cache pressure and computational redundancy, enabling faster 2-bit inference on large models like Llama-3.1, with substantial throughput gains (up to ~9x on 70B) while preserving accuracy; it also allows exploring latency-memory-accuracy trade-offs via tunable hyperparameters. The work demonstrates scalability to large models and highlights practical considerations and limitations related to on-chip memory and batch-size.

Abstract

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Paper Structure

This paper contains 29 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison of matrix multiplication kernels for codebook-based quantized models. The dequantization-based kernel performs on-the-fly dequantization, requiring the entire codebook to be loaded into cache. In contrast, CodeGEMM precomputes partial sums and stores them in a Psumbook, eliminating dequantization overhead and redundant computation.
  • Figure 2: Illustration of quantization process of a $(4 \times 32)$ weight matrix with $b = 2$, $m = 1$, $v = 8$ and $g = 16$.
  • Figure 3: Overview of the CodeGEMM kernel operation for codebook-based quantized models. 1) Input data is reshaped into vectors to align with the codebook dimensions. 2) Precomputed inner products between the codebook and input vectors are stored in the Psumbook within the programmable cache, significantly reducing computational overhead. 3) During computation, codes query the corresponding partial sums from the Psumbook, which are then accumulated to generate the output efficiently without requiring on-the-fly dequantization.
  • Figure 4: Latency and accuracy trade-offs for the Llama-3.1-8B model under various configurations.
  • Figure 5: Latency and accuracy trade-offs for the Llama-3.1-8B model under various configurations.