Table of Contents
Fetching ...

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

TL;DR

This work presents QUIK, a hybrid 4-bit (INT4) quantization method to enable end-to-end inference for large language models by quantizing both weights and activations while keeping a small set of outliers in higher precision. The authors design GPU-optimized kernels that fuse quantization and dequantization into matmul, achieving up to 3.4x end-to-end speedups and substantial memory reductions across OPT, LLaMA-2, and Falcon, with minimal perplexity loss (often 0.3–0.5). They demonstrate that most linear computations can operate in 4-bit precision, while selective outliers and 8-bit down-projections preserve accuracy in challenging layers (notably LLaMA-2 Down-Proj). The approach includes a calibration-based outlier selection, weight clipping, and a 2:4 sparsity extension for very large models, providing a practical path toward efficient local or edge deployment of GPT-family models. Overall, QUIK closes the gap between hardware-supported low-precision primitives and quantization algorithms, enabling sizable real-world improvements in throughput and memory footprint for large-scale generative models.

Abstract

Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity. Code is available at: https://github.com/IST-DASLab/QUIK.

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

TL;DR

This work presents QUIK, a hybrid 4-bit (INT4) quantization method to enable end-to-end inference for large language models by quantizing both weights and activations while keeping a small set of outliers in higher precision. The authors design GPU-optimized kernels that fuse quantization and dequantization into matmul, achieving up to 3.4x end-to-end speedups and substantial memory reductions across OPT, LLaMA-2, and Falcon, with minimal perplexity loss (often 0.3–0.5). They demonstrate that most linear computations can operate in 4-bit precision, while selective outliers and 8-bit down-projections preserve accuracy in challenging layers (notably LLaMA-2 Down-Proj). The approach includes a calibration-based outlier selection, weight clipping, and a 2:4 sparsity extension for very large models, providing a practical path toward efficient local or edge deployment of GPT-family models. Overall, QUIK closes the gap between hardware-supported low-precision primitives and quantization algorithms, enabling sizable real-world improvements in throughput and memory footprint for large-scale generative models.

Abstract

Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity. Code is available at: https://github.com/IST-DASLab/QUIK.
Paper Structure (52 sections, 1 equation, 14 figures, 14 tables, 1 algorithm)

This paper contains 52 sections, 1 equation, 14 figures, 14 tables, 1 algorithm.

Figures (14)

  • Figure 1: Accuracy and speedups for QUIK at different model sizes, on the LLaMA family of models. QUIK achieves up to 3.4x speedup with minor accuracy degradation on LLaMA-2 models.
  • Figure 2: Roofline analysis of a standard LLM MatMul operation, for a matrix of size 8K x 8K, in FP32, on an NVIDIA GPU. Markers denote the results of profiling with different token counts (from 1 to 1024). Small counts (1 and 16) are memory-bound, whereas larger counts (from 128 to 1024) are compute-bound.
  • Figure 3: Ideal matrix multiplication performance for different layer sizes and data precision on RTX3090.
  • Figure 4: Outlier-aware quantization with QUIK. Outlier weight columns are extracted based on outlier columns in the input. We permute the outlier columns toward the end of the matrix before applying GPTQ quantization (using the re-ordered Hessian matrix) to accumulate the quantization errors in the FP16 columns.
  • Figure 5: Schematic for the forward pass of a linear layer ($XW^T$) with QUIK-4B. In the first step, the input outlier features are extracted based on the pre-defined indices and the rest of the input values will be quantized using per-token quantization. The INT4 MatMul will be applied using the quantized weights, calculated offline (see Figure \ref{['fig:scheme_vis']}). Finally, the output will be dequantized, cast to FP16, and added to the result of FP16 MatMul.
  • ...and 9 more figures