Table of Contents
Fetching ...

What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation

Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan

TL;DR

This work reframes LLM quantization as perturbations added to weights and activations, introducing the lens of perturbation to analyze how different perturbation properties affect performance under zero-shot uniform quantization. By comparing perturbation distributions and magnitudes, the study reveals that perturbation magnitude drives degradation more than distribution, with outliers and clipping amplifying adverse effects; larger values exhibit robustness while small values are most vulnerable. Guided by these insights, the authors implement a nonlinear pre-transform $f(x)=x^{1/3}$ to enable a non-uniform quantization that concentrates density on small values, achieving near-lossless results at $W4A16$ and $W8A8$. The proposed non-uniform approach reduces quantization error and demonstrates a practical path toward memory-efficient LLM deployment, particularly in memory-bound scenarios such as KV-Cache-heavy inference, while highlighting the need for hardware support for computation efficiency.

Abstract

Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach "the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.

What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation

TL;DR

This work reframes LLM quantization as perturbations added to weights and activations, introducing the lens of perturbation to analyze how different perturbation properties affect performance under zero-shot uniform quantization. By comparing perturbation distributions and magnitudes, the study reveals that perturbation magnitude drives degradation more than distribution, with outliers and clipping amplifying adverse effects; larger values exhibit robustness while small values are most vulnerable. Guided by these insights, the authors implement a nonlinear pre-transform to enable a non-uniform quantization that concentrates density on small values, achieving near-lossless results at and . The proposed non-uniform approach reduces quantization error and demonstrates a practical path toward memory-efficient LLM deployment, particularly in memory-bound scenarios such as KV-Cache-heavy inference, while highlighting the need for hardware support for computation efficiency.

Abstract

Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach "the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.
Paper Structure (16 sections, 4 equations, 6 figures, 3 tables)

This paper contains 16 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Performance of quantized LLMs on Lambada paperno-EtAl:2016:P16-1 on different model families (BLOOM scao2022bloom, OPT zhang2022opt, and LLAMA touvron2023llama), the parameter number scaling from 350M to 66B. We implement two uniform quantization settings: one is to quantize both weights and activations to 8 bits (W8A8), and the other is to perform 4-bit channel-wise for weights only (W4A16).
  • Figure 2: A toy example of 4-bit uniform quantization with different scale factors $\alpha$. The top line shows a choice of $\alpha$ that is too large, leading to too much information lost in the quantization process. The middle represents the most common choice, where $\alpha$ is set to the maximum value of the tensor. The bottom line uses a smaller $\alpha$, which results in less perturbation at the cost of clipping out-of-range values.
  • Figure 3: An illustration of the relationship between the scale factor and the intensity of perturbation. To calculate the perturbation intensity, we use the L2 norm of $\Delta$, which represents the distance between the high-precision tensor and the quantized one.
  • Figure 4: Lambada paperno-EtAl:2016:P16-1 performance comparison on different types of perturbations. Here No perturbation represents the vanilla LLM (without quantization and without perturbation). Uniform quantization represents for W8A8 uniform quantization. For a fair comparison, all artificial perturbations are set to have the same variance with the native perturbation caused by Uniform quantization. The perturbation methods are run four times with different random seeds.
  • Figure 5: Effects of clipping on Lambada dataset (left) and Wikitext-2 merity2016pointer and C4 2019t5 (averaged ppl.) (right). We compare three options of $k$, which determines the clipping threshold: $k\in\{3,5,10\}$. It is worth noting that even with the most stringent clipping setting, where $k=3$, on average, only less than 0.1% of values are affected by clipping.
  • ...and 1 more figures