CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian
TL;DR
This work tackles the challenge of deploying large language models under strict memory and compute constraints by proposing CLAQ, a training-free post-training quantization framework that operates at the column level. CLAQ combines K-Means-based per-column codebooks with an Outlier Order sensitivity metric to guide two adaptive strategies: Adaptive Precision (AP) and Outlier Reservation (OR), and can fuse AP+OR for extreme low-bit scenarios. Empirical results on LLaMA-1/2 and Yi-34B across perplexity and zero-shot benchmarks demonstrate state-of-the-art performance at 2–4 bit quantization, with substantial gains particularly at 2-bit, while keeping modest memory overhead. The approach offers a practical path to highly compressed, efficient LLM deployment on resource-constrained devices, and the code is publicly available for reproducibility and further research.
Abstract
Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ.
