Table of Contents
Fetching ...

FPTQ: Fine-grained Post-Training Quantization for Large Language Models

Qingyuan Li, Yifan Zhang, Liang Li, Peng Yao, Bo Zhang, Xiangxiang Chu, Yerui Sun, Li Du, Yuchen Xie

TL;DR

The paper tackles the deployment bottleneck of large language models by introducing Fine-grained Post-Training Quantization (FPTQ), a practical $W4A8$ PTQ method that preserves accuracy without fine-tuning. It combines layer-wise activation strategies with fine-grained weight quantization and a novel Logarithmic Activation Equalization to stabilize activations and reduce quantization error. Across BLOOM, LLaMA, and LLaMA-2, FPTQ achieves near FP16 accuracy on standard benchmarks like LAMBADA, MMLU, and Common Sense QA, outperforming or matching existing PTQ methods while maintaining hardware efficiency. The work also demonstrates data-free viability and analyzes the interaction with alternative weight-refinement techniques (e.g., GPTQ), offering tangible steps toward practical, widespread deployment of open LLMs. Overall, FPTQ provides a low-cost, high-accuracy path to get large models running in real-world settings with modest calibration and no fine-tuning requirements.

Abstract

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which is mainly centered on two recipes W8A8 and W4A16 (i.e. weights and activations in such bit widths). In this study, we propose a novel W4A8 post-training quantization method for the available open-sourced LLMs, which combines the advantages of both two recipes. Therefore, we can leverage the benefit in the I/O utilization of 4-bit weight quantization and the acceleration due to 8-bit matrix computation. Nevertheless, the W4A8 faces notorious performance degradation. As a remedy, we involve layerwise activation quantization strategies which feature a novel logarithmic equalization for most intractable layers, and we combine them with fine-grained weight quantization. Without whistles and bells, we eliminate the necessity for further fine-tuning and obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks. We confirm that the W4A8 quantization is achievable for the deployment of large language models, fostering their wide-spreading real-world applications.

FPTQ: Fine-grained Post-Training Quantization for Large Language Models

TL;DR

The paper tackles the deployment bottleneck of large language models by introducing Fine-grained Post-Training Quantization (FPTQ), a practical PTQ method that preserves accuracy without fine-tuning. It combines layer-wise activation strategies with fine-grained weight quantization and a novel Logarithmic Activation Equalization to stabilize activations and reduce quantization error. Across BLOOM, LLaMA, and LLaMA-2, FPTQ achieves near FP16 accuracy on standard benchmarks like LAMBADA, MMLU, and Common Sense QA, outperforming or matching existing PTQ methods while maintaining hardware efficiency. The work also demonstrates data-free viability and analyzes the interaction with alternative weight-refinement techniques (e.g., GPTQ), offering tangible steps toward practical, widespread deployment of open LLMs. Overall, FPTQ provides a low-cost, high-accuracy path to get large models running in real-world settings with modest calibration and no fine-tuning requirements.

Abstract

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which is mainly centered on two recipes W8A8 and W4A16 (i.e. weights and activations in such bit widths). In this study, we propose a novel W4A8 post-training quantization method for the available open-sourced LLMs, which combines the advantages of both two recipes. Therefore, we can leverage the benefit in the I/O utilization of 4-bit weight quantization and the acceleration due to 8-bit matrix computation. Nevertheless, the W4A8 faces notorious performance degradation. As a remedy, we involve layerwise activation quantization strategies which feature a novel logarithmic equalization for most intractable layers, and we combine them with fine-grained weight quantization. Without whistles and bells, we eliminate the necessity for further fine-tuning and obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks. We confirm that the W4A8 quantization is achievable for the deployment of large language models, fostering their wide-spreading real-world applications.
Paper Structure (25 sections, 4 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Activation distribution before and after logarithmic equalization on BLOOM-7B1.
  • Figure 2: (a) Two stages of LLM inference where context decoding is compute-bound and self-decoding is memory-bound. (b) W4A8 speeds up both stages and is faster than the other two.
  • Figure 3: Visualization of activation distribution of $o_{proj}$ and $down_{proj}$ on LLaMA-7B.
  • Figure 4: (a) Per-channel weight quantization. (b) Fine-grained per-channel quantization. (c, d) Self-attention and FFN in most LLMs. Light blue: per-tensor static activation quantization. Purple: per-token dynamic activation quantization. All weights are quantized in a fine-grained manner.
  • Figure 5: Visualization of activation distribution of $o_{proj}$ and $down_{proj}$ on LLaMA-2-7B.
  • ...and 5 more figures