GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Yipin Guo; Yilin Lang; Qinyuan Ren

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Yipin Guo, Yilin Lang, Qinyuan Ren

TL;DR

GPTQT addresses the memory and compute bottleneck of large language models by proposing a post-training quantization method that converts weights to low-bit binary coding through a two-stage process: first linear quantization to a higher bit, then a binary coding step. A re-exploration of the scaling factor is introduced to compensate for the representation-range changes, and the inference path is fused into a pure binary-coding path, enabling efficient GPU execution via methods like LUT-GEMM. Empirical results on OPT, Llama2, and Bloom show improved perplexity over strong 3-bit baselines (e.g., a $4.01$ perplexity reduction on opt-66B) and notable speedups (up to $1.24\times$ over GPTQ on opt-30B). The approach maintains competitive accuracy at low-bit depths where traditional PTQ struggles, though activations remain fp16, which can limit throughput in high-concurrency settings.

Abstract

Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT's effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

TL;DR

perplexity reduction on opt-66B) and notable speedups (up to

over GPTQ on opt-30B). The approach maintains competitive accuracy at low-bit depths where traditional PTQ struggles, though activations remain fp16, which can limit throughput in high-concurrency settings.

Abstract

Paper Structure (17 sections, 11 equations, 4 figures, 6 tables)

This paper contains 17 sections, 11 equations, 4 figures, 6 tables.

Introduction
GPTQT: quantize LLM twice
Background
Quantize Weight Twice
Re-explore Scale Factor
Intermediate Steps Can be Fused in Inference
Experiment
Setup
Result on OPT
Result on Llama2 and Bloom
Result on PTB dataset
Speed up of GPTQT
Ablation Study
Quantizaion Overfitting
Intermediate Bit
...and 2 more sections

Figures (4)

Figure 1: GPTQT: Quantize Weight Twice. Initially, the fp16 weight model is quantized to a relatively high bit number (3 bits shown) using linear quantization. Subsequently, the resulting int-type weight is further reduced to fewer bits (2 bits depicted) using binary coding.
Figure 2: Re-exploring scale factor.
Figure 3: Binary coding is a unique variant of linear quantization, structured in a tree-like form. GPTQT selects specific nodes and cotyledons from the linear quantization tree to create a new binary coding tree, thus bypassing intermediate steps during inference. Dark colors indicate final results while light colors denote intermediate results.
Figure 4: The impact of Intermediate Bit on results.

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

TL;DR

Abstract

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Authors

TL;DR

Abstract

Table of Contents

Figures (4)