Table of Contents
Fetching ...

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang

TL;DR

EasyQuant presents a data-free, weight-only 4-bit quantization method for LLMs that preserves a small fraction of weight outliers in full precision and optimizes quantization ranges via gradient-based updates. By isolating outliers and applying per-channel range optimization, it achieves near-lossless performance while guaranteeing generalization without data, quantizing 176B-scale models within minutes. The approach outperforms several data-dependent baselines on perplexity and zeroshot tasks and imposes negligible latency overhead from outlier handling. Key insights include the outsized impact of weight outliers and a differentiable gradient for range optimization, which together enable a fast, scalable, and generalizable data-free quantization framework. This work enables practical deployment of extremely large models with substantial memory and compute savings, though it relies on CUDA kernels and does not inherently reduce inference compute.

Abstract

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

TL;DR

EasyQuant presents a data-free, weight-only 4-bit quantization method for LLMs that preserves a small fraction of weight outliers in full precision and optimizes quantization ranges via gradient-based updates. By isolating outliers and applying per-channel range optimization, it achieves near-lossless performance while guaranteeing generalization without data, quantizing 176B-scale models within minutes. The approach outperforms several data-dependent baselines on perplexity and zeroshot tasks and imposes negligible latency overhead from outlier handling. Key insights include the outsized impact of weight outliers and a differentiable gradient for range optimization, which together enable a fast, scalable, and generalizable data-free quantization framework. This work enables practical deployment of extremely large models with substantial memory and compute savings, though it relies on CUDA kernels and does not inherently reduce inference compute.

Abstract

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.
Paper Structure (30 sections, 11 equations, 2 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 11 equations, 2 figures, 10 tables, 1 algorithm.

Figures (2)

  • Figure 1: Pipeline of EasyQuant. We first find all the outliers in weight and keep them in full precision (fp32/fp16/bf16). Afterward, we optimize the quantization range (denoted as $q_{range}$) in order to approximate the normal values more precisely. In the end, the normal values are quantized into lower bits (denoted as $Q[\cdot]$) with optimized quantization ranges and we set the outliers unchanged in weight.
  • Figure 2: Smaller reconstruction error cannot guarantee a better model performance. Straightforwardly shrinking the quantization ranges will clip most of the outliers to be very small, hence the perplexity increases severely since those outliers are critical for preserving the model's performance. However, when keeping those outliers unquantized, the quantized model achieves a better performance as the reconstruction error decreases continuously. This result clearly suggests that the outliers are more important than the normal values in weight, and optimizing the quantization ranges using gradient defined in \ref{['eq:def_grad']} can significantly increase the accuracy of quantized models. More details about the experiment can be found in Section \ref{['sec:exp']}.