Table of Contents
Fetching ...

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

TL;DR

QLLM tackles the activation outlier bottleneck in low-bitwidth PTQ for large language models by introducing adaptive channel reassembly (disassembly of outlier channels into sub-channels and assembly of similar channels) and a gradient-efficient low-rank error correction. The method remains training-efficient by using gradient-free reassembly and a small set of learnable low-rank weights, which can be fused into frozen weights post-training. Empirical results on LLaMA-1/2 show strong gains at 4-bit quantization, including faster quantization times (e.g., 70B in 10 hours on A100-80G) and significant accuracy improvements over SOTA PTQ methods, sometimes surpassing QAT baselines. The work demonstrates practical viability for deploying ultra-low-bit LLMs with reduced training and inference overheads, while outlining avenues for further speedups via kernel fusion and broader channel merging strategies.

Abstract

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

TL;DR

QLLM tackles the activation outlier bottleneck in low-bitwidth PTQ for large language models by introducing adaptive channel reassembly (disassembly of outlier channels into sub-channels and assembly of similar channels) and a gradient-efficient low-rank error correction. The method remains training-efficient by using gradient-free reassembly and a small set of learnable low-rank weights, which can be fused into frozen weights post-training. Empirical results on LLaMA-1/2 show strong gains at 4-bit quantization, including faster quantization times (e.g., 70B in 10 hours on A100-80G) and significant accuracy improvements over SOTA PTQ methods, sometimes surpassing QAT baselines. The work demonstrates practical viability for deploying ultra-low-bit LLMs with reduced training and inference overheads, while outlining avenues for further speedups via kernel fusion and broader channel merging strategies.

Abstract

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.
Paper Structure (32 sections, 10 equations, 5 figures, 20 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 5 figures, 20 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the channel-wise maximum and minimum values for the input activations of a linear layer in LLaMA-65B for (a) original pre-trained model (b) after SmoothQuant xiao2023smoothquant and (c) after our channel reassembly.
  • Figure A: PyTorch style pseudo codes of channel disassembly and assembly during runtime.
  • Figure B: An illustration of the searched expansion ratios using our adaptive strategy for 4-bit LLaMA-1-7B.
  • Figure C: An illustration of the searched expansion ratios using our adaptive strategy for 4-bit LLaMA-1-13B.
  • Figure D: An illustration of the channel-wise maximum and minimum input activation values for the MSA, up projection and down projection layers in FFN of different blocks in LLaMA-1-13B.