Table of Contents
Fetching ...

TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Yuhang Li, Priyadarshini Panda

TL;DR

TesseraQ is proposed, a new state-of-the-art PTQ technique, to quantize the weights of LLMs to ultra-low bits and can be seamlessly integrated with existing scaling or clipping-based PTQ algorithms, significantly enhancing their performance and establishing a new state-of-the-art PTQ technique.

Abstract

Large language models (LLMs) have revolutionized natural language processing, albeit at the cost of immense memory and computation requirements. Post-training quantization (PTQ) is becoming the de facto method to reduce the memory footprint and improve the inference throughput of LLMs. In this work, we aim to push the upper limit of LLM PTQ by optimizing the weight rounding parameters with the block reconstruction technique, a predominant method in previous vision models. We propose TesseraQ, a new state-of-the-art PTQ technique, to quantize the weights of LLMs to ultra-low bits. To effectively optimize the rounding in LLMs and stabilize the reconstruction process, we introduce progressive adaptive rounding. This approach iteratively transits the soft rounding variables to hard variables during the reconstruction process. Additionally, we optimize the dequantization scale parameters to fully leverage the block reconstruction technique. We demonstrate that TesseraQ can be seamlessly integrated with existing scaling or clipping-based PTQ algorithms such as AWQ and OmniQuant, significantly enhancing their performance and establishing a new state-of-the-art. For instance, when compared to AWQ, TesseraQ improves the wikitext2 perplexity from 14.65 to 6.82 and average downstream accuracy from 50.52 to 59.27 with 2-bit weight-only quantization of LLaMA-2-7B. Across a range of quantization schemes, including W2A16, W3A16, W3A3, and W4A4, TesseraQ consistently exhibits superior performance.

TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

TL;DR

TesseraQ is proposed, a new state-of-the-art PTQ technique, to quantize the weights of LLMs to ultra-low bits and can be seamlessly integrated with existing scaling or clipping-based PTQ algorithms, significantly enhancing their performance and establishing a new state-of-the-art PTQ technique.

Abstract

Large language models (LLMs) have revolutionized natural language processing, albeit at the cost of immense memory and computation requirements. Post-training quantization (PTQ) is becoming the de facto method to reduce the memory footprint and improve the inference throughput of LLMs. In this work, we aim to push the upper limit of LLM PTQ by optimizing the weight rounding parameters with the block reconstruction technique, a predominant method in previous vision models. We propose TesseraQ, a new state-of-the-art PTQ technique, to quantize the weights of LLMs to ultra-low bits. To effectively optimize the rounding in LLMs and stabilize the reconstruction process, we introduce progressive adaptive rounding. This approach iteratively transits the soft rounding variables to hard variables during the reconstruction process. Additionally, we optimize the dequantization scale parameters to fully leverage the block reconstruction technique. We demonstrate that TesseraQ can be seamlessly integrated with existing scaling or clipping-based PTQ algorithms such as AWQ and OmniQuant, significantly enhancing their performance and establishing a new state-of-the-art. For instance, when compared to AWQ, TesseraQ improves the wikitext2 perplexity from 14.65 to 6.82 and average downstream accuracy from 50.52 to 59.27 with 2-bit weight-only quantization of LLaMA-2-7B. Across a range of quantization schemes, including W2A16, W3A16, W3A3, and W4A4, TesseraQ consistently exhibits superior performance.

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: The overall workflow of our proposed method. (a) We apply TesseraQ to optimize the weight rounding parameters when the transformation scale and clipping range are determined using prior methods like AWQ/OmniQuant. (b) We propose Progressive Adaptive Rounding (PAR) for block-wise reconstruction, which iteratively hardens some rounding variables and optimizes the rest soft rounding variables till all variables become binary.
  • Figure 2: Perplexity comparison of TesseraQ with other PTQ methods on LLaMA-2-7B model quantized to different weight precision (INT2, INT3). g denotes the group size.
  • Figure 3: Ablation study of PAR schedule. We experiment several rule-based $P$ adjustments and one handcrafted adjustment. (AWQ baseline results: average PPL: 16.66, average acc.: 50.52).
  • Figure 4: Reconstruction loss convergence. We compare the block reconstruction loss of OmniQuant and TesseraQ during optimization. Our method significantly reduces the loss in each block.