Table of Contents
Fetching ...

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Chengtao Lv, Yunchen Zhang, Xianglong Liu, Dacheng Tao

TL;DR

LLMC, a plug-and-play compression toolkit, is presented to fairly and systematically explore the impact of quantization, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification.

Abstract

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardwares, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our toolkit is available at https://github.com/ModelTC/llmc.

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

TL;DR

LLMC, a plug-and-play compression toolkit, is presented to fairly and systematically explore the impact of quantization, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification.

Abstract

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardwares, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our toolkit is available at https://github.com/ModelTC/llmc.
Paper Structure (21 sections, 1 equation, 9 figures, 34 tables)

This paper contains 21 sections, 1 equation, 9 figures, 34 tables.

Figures (9)

  • Figure 1: Overview of our LLM compression toolkit LLMC, which incorporates diverse algorithms, ultra-low cost quantization, multiple backends support, and high extensibility. More features are under development.
  • Figure 2: Token distribution for calibration/test datasets. The y-axis shows frequency, the x-axis shows token ID, and "$\mathcal{D}_{KL}$" calculates the KL divergence between the calibration data and the specific test data: WikiText2.
  • Figure 3: Kurtosis value of weights (Left) and input activations (Right) with various layer types for different methods under w6a6 quantization. The legends denote the quantization method and its corresponding PPL on WikiText2. We do not employ transformation for down_proj for a fair comparison, as only default AWQ and QuaRot include this position. The colorful values represent changes of $K$ after using transformation for down_proj for all scaling-based methods, and online transformation for QuaRot. To be noted, we only mark numbers $>0.2$ for all the cases.
  • Figure 4: Comparison between asymmetric and symmetric weight clippingw.r.t. asymmetric/symmetric quantization. After weight clipping, we obtain the final range of tensor to quantize as depicted in the solid gray box related to asymmetric/symmetric quantization.
  • Figure 5: Visualization of relative quantization errors for the weight of q_proj in the first block for w3a16g128 LLaMA-3-8B. $\widehat{\boldsymbol{W}}$ represents the quantized counterpart of the weight $\boldsymbol{W}$.
  • ...and 4 more figures