Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Han Guo; William Brandon; Radostin Cholakov; Jonathan Ragan-Kelley; Eric P. Xing; Yoon Kim

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim

TL;DR

This work tackles memory bandwidth as the primary bottleneck in token-by-token LLM inference by leveraging weight-only LUT quantization and fused dequantization-matmul kernels. It introduces FLUTE, a fused kernel built around offline weight restructuring, vectorized shared-memory LUTs, and Stream-K workload partitioning to support mixed-type matrix multiplications with non-uniform LUT quantization, including 3-bit weights. The approach delivers substantial gains, achieving 2–4x speedups over dense FP16 GEMMs at small batch sizes for 4-bit LUTs and 1.5–2x end-to-end throughput improvements on LLaMA3 with learned NF quantization, across multiple configurations and frameworks. These results demonstrate practical performance improvements and highlight hardware-design considerations for future accelerators to better support mixed-type, LUT-based quantization in LLM inference.

Abstract

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 2 equations, 8 figures, 4 tables, 2 algorithms.

Introduction
Background and Related Work
GPU Architecture and Memory Bandwidth Bottlenecks
LLM Deployment Characteristics
Weight-only Quantization in LLMs
FLUTE: A Fast and Flexible Kernel for Mixed-Type Matrix Multiplications
Offline Matrix Restructuring
Vectorized Lookup in Shared Memory
Reducing bank conflicts.
Stream-K Workload Partitioning
Mixed precision accumulation and global reduction.
Experiments
Kernel Benchmarks
LUT quantization method.
Results.
...and 5 more sections

Figures (8)

Figure 1: A simplified view of a kernel that fuses the dequantization and matmul steps. Each threadblock (group of threads) is responsible for computing one or more output tiles by performing the matrix product between specific rows of inputs and columns of weights. (1) The threadblock issues asynchronous copy instructions to fetch small chunks of input data (tiles) from global memory to shared memory. (2) As soon as a tile arrives in shared memory, it is further sliced into smaller chunks (fragments) and copied into registers. (3) Once all necessary components are in the registers, the quantized matrix undergoes dequantization. (4) The dequantized matrix and inputs are then processed by Tensor Cores using MMA (Matrix Multiply Accumulate) instructions. (5) Finally, the accumulated results are written back from the registers to the outputs in global memory.
Figure 2: Vectorized Lookup Table Design (Left). Instead of dequantizing one element at a time, we vectorize the lookup table by creating another table that holds the values of all possible pairs of indices. This can look up two values simultaneously, followed by efficient vectorized scaling operations. Stream-K Work Decomposition (Right). In classic work decomposition, output tile production is independently assigned to threadblocks. Each threadblock processes one (or more) rows of the left operand and one (or more) columns of the right operand, slicing down the inner K dimension to compute the corresponding output tile (Slice-K). However, when the weight matrix is heavily quantized, the reduced size can lead to "stragglers" in Slice-K due to uneven workload assignment. Stream-K osama2023stream addresses this by decomposing work at a finer granularity, enabling multiple threadblocks to collaboratively compute a single output tile.
Figure 3: Runtime performance of FLUTE in the standard W4G128 setting where, the weights are quantized to 4 bits in groups of 128. We show speedup against 16-bit torch.mm. The matrix shapes for our benchmarks are selected based on those used in Llama-3-8B (top row) and Llama-3-70B (bottom row) models. For each M-N-K shape tuple, we generate three random sets of data, run each kernel on the data 100 times, and average. While our main comparisons are against other LUT kernels (bitsandbytes, BitBLAS-NF4), for reference we also include comparisons with kernels that only support uniform (integer) dequantization (Marlin, BitBLAS). These results are represented with dashed lines in our figures.
Figure 4: Runtime performance at various bit-widths and group sizes with N=K=8192. FLUTE consistently achieves speedups across different settings, including the in the 3-bit configuration.
Figure 5: End-to-end latency benchmark for processing a single batch of requests using vLLM. We evaluated LLaMA-3 (8B and 70B) and Gemma-2 (9B and 27B) models with various configurations, including different bits, model sizes, number of GPUs, input lengths, output lengths, and batch sizes. The models were quantized using a group size of 64 to achieve a good balance between quality and speed.
...and 3 more figures

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

TL;DR

Abstract

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)