Table of Contents
Fetching ...

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu

TL;DR

This survey surveys low-bit quantization for large language models, addressing the problem of prohibitive memory and compute demands. It integrates three perspectives—basics (formats, granularity, dynamic/static quantization), systems (frameworks, hardware, and KV-cache handling), and algorithms (QAT/PTQ, training and PEFT, and advanced transformations). The work catalogs representative data formats, quantization strategies, and toolchains, and discusses practical deployment across diverse hardware with a view toward future directions such as KV-cache compression and hardware-aware co-design. Overall, it provides a comprehensive roadmap for making large-scale LLMs more efficient and broadly deployable through low-bit quantization techniques.

Abstract

Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

TL;DR

This survey surveys low-bit quantization for large language models, addressing the problem of prohibitive memory and compute demands. It integrates three perspectives—basics (formats, granularity, dynamic/static quantization), systems (frameworks, hardware, and KV-cache handling), and algorithms (QAT/PTQ, training and PEFT, and advanced transformations). The work catalogs representative data formats, quantization strategies, and toolchains, and discusses practical deployment across diverse hardware with a view toward future directions such as KV-cache compression and hardware-aware co-design. Overall, it provides a comprehensive roadmap for making large-scale LLMs more efficient and broadly deployable through low-bit quantization techniques.

Abstract

Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.
Paper Structure (51 sections, 34 equations, 11 figures, 5 tables, 3 algorithms)

This paper contains 51 sections, 34 equations, 11 figures, 5 tables, 3 algorithms.

Figures (11)

  • Figure 1: The skeleton of the LLM Quantization methods. The diagram illustrates the main areas in the survey.
  • Figure 2: Illustrations for different quantization granularity.
  • Figure 3: Dynamic and static quantization. Operations in the green block mean the inference process, while outside the block is the production and preparation process.
  • Figure 4: Data transmission of weight and activation in the caching system during inference. The bandwidth and latency are officially reported by NVIDIA A100 as an example. PCIe is a high-speed interface standard used for connecting various hardware components, such as GPUs, SSDs. Async_Copy means asynchronous data copy using cp.async intrinsic. ldmatrix and lds are data loading instructions that load matrix from shared memory to registers with a strict layout requirement or in a fine-grained and flexible manner, respectively NVIDIA_PTX_ISA.
  • Figure 5: The data transmission process of quantization for (a) Quantized weight preparation (weight pack), (b) Weight-only quantization, and (c) Weight & Activation quantization.
  • ...and 6 more figures