Evaluating Quantized Large Language Models

Shiyao Li; Xuefei Ning; Luning Wang; Tengxuan Liu; Xiangsheng Shi; Shengen Yan; Guohao Dai; Huazhong Yang; Yu Wang

Evaluating Quantized Large Language Models

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

TL;DR

This work provides a comprehensive PTQ-based evaluation of quantized LLMs across 11 model families and five ability domains, examining Weight, Activation, and KV Cache quantization. By combining tensor-distribution statistics with extensive experiments on basic NLP, emergent abilities, trustworthiness, dialogue, and long-context tasks, it derives practical guidance on bit-width choices and when quantization preserves performance within small margins. Key findings show that Weight and KV Cache quantization are generally more robust than Activation quantization, with $W4$, $W4A8$, and $KV4$ achieving near lossless performance in many settings, while extremely low-bit-widths remain challenging. The study also highlights model-size and task-dependent nuances, including emergent abilities and ethics considerations, and offers actionable recommendations for deploying quantized LLMs under memory and latency constraints. The results contribute to practical quantization strategies and identify directions for future improvements such as mix-precision schemes and QAT-alternative approaches.

Abstract

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.

Evaluating Quantized Large Language Models

TL;DR

, and

achieving near lossless performance in many settings, while extremely low-bit-widths remain challenging. The study also highlights model-size and task-dependent nuances, including emergent abilities and ethics considerations, and offers actionable recommendations for deploying quantized LLMs under memory and latency constraints. The results contribute to practical quantization strategies and identify directions for future improvements such as mix-precision schemes and QAT-alternative approaches.

Abstract

Paper Structure (62 sections, 3 equations, 33 figures, 19 tables)

This paper contains 62 sections, 3 equations, 33 figures, 19 tables.

Introduction
Preliminaries
Quantization
Benchmarks and Models
Statistical Analysis
Evaluation on Basic NLP Tasks
Experimental Setups
Effects of Quantization on Three Tensor Types
Effects of Quantization on Different LLMs
Effects of Quantization on Different Tasks
Evaluation on Emergent Abilities
Experimental Setups
Experimental Results
Evaluation on Trustworthiness
Experimental Setups
...and 47 more sections

Figures (33)

Figure 1: (a) Per-token quantization for Activation, (b) Group-wise quantization for Weight and KV Cache.
Figure 2: The effect of quantization on different tensor types on LAMBADA (Natural Language Understanding task).
Figure 3: Performances of the quantized LLMs with respect to their parameter scales. The parameter memory overheads are estimated by multiplying the parameter size by the quantization bit-width. The markers, $'\bullet', ~'\blacktriangle', ~'\blacksquare', ~'\blacklozenge', ~'+'$ denote the quantization bit-widths, W2, W3, W4, W8, FP16 respectively.
Figure 4: The effect of quantization on four emergent abilities. We normalize the performance of quantized LLMs based on the performance of FP16 LLMs. "ICL", "C-MR", "M-MR", "IF", "SC" are short for "In-Context Learning", "Commonsense Multi-Step Reasoning", "Mathematical Multi-Step Reasoning", "Instruction-Following", and "Self-Calibration".
Figure 5: The effect of quantization on the Ethics Benchmark.
...and 28 more figures

Evaluating Quantized Large Language Models

TL;DR

Abstract

Evaluating Quantized Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (33)