Table of Contents
Fetching ...

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

TL;DR

This work tackles the challenge of deploying large language models under strict memory and compute constraints by proposing CLAQ, a training-free post-training quantization framework that operates at the column level. CLAQ combines K-Means-based per-column codebooks with an Outlier Order sensitivity metric to guide two adaptive strategies: Adaptive Precision (AP) and Outlier Reservation (OR), and can fuse AP+OR for extreme low-bit scenarios. Empirical results on LLaMA-1/2 and Yi-34B across perplexity and zero-shot benchmarks demonstrate state-of-the-art performance at 2–4 bit quantization, with substantial gains particularly at 2-bit, while keeping modest memory overhead. The approach offers a practical path to highly compressed, efficient LLM deployment on resource-constrained devices, and the code is publicly available for reproducibility and further research.

Abstract

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ.

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

TL;DR

This work tackles the challenge of deploying large language models under strict memory and compute constraints by proposing CLAQ, a training-free post-training quantization framework that operates at the column level. CLAQ combines K-Means-based per-column codebooks with an Outlier Order sensitivity metric to guide two adaptive strategies: Adaptive Precision (AP) and Outlier Reservation (OR), and can fuse AP+OR for extreme low-bit scenarios. Empirical results on LLaMA-1/2 and Yi-34B across perplexity and zero-shot benchmarks demonstrate state-of-the-art performance at 2–4 bit quantization, with substantial gains particularly at 2-bit, while keeping modest memory overhead. The approach offers a practical path to highly compressed, efficient LLM deployment on resource-constrained devices, and the code is publicly available for reproducibility and further research.

Abstract

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ.
Paper Structure (29 sections, 6 equations, 5 figures, 13 tables)

This paper contains 29 sections, 6 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: The K-Means clustering based quantization. Elements in weight matrix column are input of K-Means clustering and the quantization centroids are derived as the output of clustering algorithm. Then the pre-trained weights are quantized to the nearest K-Means class center.
  • Figure 2: The overall structure of CLAQ: quantized models are obtained from different quantization approaches. The single-precision K-Means based quantization runs without sensitivity calculation. We leverage outlier ratio based quantization sensitivity metric (a) to provide the guidance of the column-level adaptive outlier reservation (OR, b) and column-level adaptive precision (AP, c). The AP and OR strategies are orthogonal to each other.
  • Figure 3: The sorted outliers ratio in a self-attention matrix of LLaMA1-7B, most columns contain few outliers.
  • Figure 4: The position of columns with higher outlier ratio in matrix. Columns with dark colour are the top 10% outlier concentrated columns.
  • Figure 5: The overall outlier ratio of 32 layers in LLaMA1-7B.