Table of Contents
Fetching ...

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang

TL;DR

This work identifies activation quantization loss in PTQ for large language models as primarily due to a quantization kernel—activation elements mapped to zero. It introduces CrossQuant, a kernel-minimizing cross-quantization that uses row and column maxima scaled by an exponent $\alpha$ to substantially reduce the kernel size ($\approx$16% for OPT and $<0.1%$ for LLaMA) and achieves near FP16 accuracy across language modeling, zero-shot, and few-shot tasks without retraining. Key contributions include formalizing the kernel concept, establishing practical thresholds (OPT ~19%, LLaMA ~1%), and demonstrating that CrossQuant outperforms or matches strong baselines across a wide range of model sizes. The approach offers a simple, training-free route to high-precision activation quantization, enabling more efficient deployment of large-scale transformers.

Abstract

Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after activating quantization. To investigate the primary cause, we extend the concept of kernel from linear algebra to quantization functions to define a new term, "quantization kernel", which refers to the set of elements in activations that are quantized to zero. Through quantitative analysis of the quantization kernel, we find that these elements are crucial for maintaining the accuracy of quantized LLMs. With the decrease of quantization kernel, the precision of quantized LLMs increases. If the quantization kernel proportion is kept below 19% for OPT models and below 1% for LLaMA models, the precision loss from quantizing activations to INT8 becomes negligible. Motivated by the goal of developing a quantization method with small quantization kernel, we propose CrossQuant: a simple yet effective method for quantizing activations. CrossQuant cross-quantizes elements using row and column-wise absolute maximum vectors, achieving a quantization kernel of approximately 16% for OPT models and less than 0.1% for LLaMA models. Experimental results on LLMs (LLaMA, OPT) ranging from 6.7B to 70B parameters demonstrate that CrossQuant improves or maintains perplexity and accuracy in language modeling, zero-shot, and few-shot tasks.

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

TL;DR

This work identifies activation quantization loss in PTQ for large language models as primarily due to a quantization kernel—activation elements mapped to zero. It introduces CrossQuant, a kernel-minimizing cross-quantization that uses row and column maxima scaled by an exponent to substantially reduce the kernel size (16% for OPT and for LLaMA) and achieves near FP16 accuracy across language modeling, zero-shot, and few-shot tasks without retraining. Key contributions include formalizing the kernel concept, establishing practical thresholds (OPT ~19%, LLaMA ~1%), and demonstrating that CrossQuant outperforms or matches strong baselines across a wide range of model sizes. The approach offers a simple, training-free route to high-precision activation quantization, enabling more efficient deployment of large-scale transformers.

Abstract

Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after activating quantization. To investigate the primary cause, we extend the concept of kernel from linear algebra to quantization functions to define a new term, "quantization kernel", which refers to the set of elements in activations that are quantized to zero. Through quantitative analysis of the quantization kernel, we find that these elements are crucial for maintaining the accuracy of quantized LLMs. With the decrease of quantization kernel, the precision of quantized LLMs increases. If the quantization kernel proportion is kept below 19% for OPT models and below 1% for LLaMA models, the precision loss from quantizing activations to INT8 becomes negligible. Motivated by the goal of developing a quantization method with small quantization kernel, we propose CrossQuant: a simple yet effective method for quantizing activations. CrossQuant cross-quantizes elements using row and column-wise absolute maximum vectors, achieving a quantization kernel of approximately 16% for OPT models and less than 0.1% for LLaMA models. Experimental results on LLMs (LLaMA, OPT) ranging from 6.7B to 70B parameters demonstrate that CrossQuant improves or maintains perplexity and accuracy in language modeling, zero-shot, and few-shot tasks.

Paper Structure

This paper contains 18 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: To examine the impact of quantization kernel on the quantization loss, we evaluate the average accuracy of various quantization methods for OPT family models across several zero-shot tasks, including Lambada, ARC-easy, Hellaswag, PIQA, and BoolQ. FP16 serves as the baseline, while W4 refers to weights quantized to INT4. A8 represents activations quantized to INT8, and "Remove Kernel" refers to directly setting the elements in quantization kernel to zero without quantizing the other elements in the activations.
  • Figure 2: A comparison table of Per-token quantization and CrossQuant.
  • Figure 3: An example illustrates quantization kernel of two methods on a sample activation matrix ${\bm{X}}$, where "acc" is the average accuracy of OPT-66B on five zero-shots tasks: Lambada, ARC-easy, Hellaswag, PIQA and BoolQ.
  • Figure 4: The average proportion of kernels of both quantization methods are calculated in all activations in OPT (left) and LLaMA (right) models on WikiText2.
  • Figure 5: We evaluate the OPT and LLaMA models on the language modeling task using the WikiText2 dataset, measuring performance via perplexity. The top two groups are tested with W8A8, while the bottom two groups use W4A8-g128.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1