Table of Contents
Fetching ...

Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis

Jiaqi Zhao, Ming Wang, Miao Zhang, Yuzhang Shang, Xuebo Liu, Yaowei Wang, Min Zhang, Liqiang Nie

TL;DR

This work introduces PTQ-Bench, a unified benchmarking framework for post-training quantization in LLMs. It provides a comprehensive taxonomy (compensation-based, rotation-based, salience-based, optimization-based), a unified evaluation across bitwidths, architectures, and modalities, and detailed comparative analyses. Key findings show that salience-based methods excel at higher bitwidths while compensation- and rotation-based strategies offer robust performance at very low bitwidths; the largest models at $2$-bit can underperform smaller models at $4$-bit, highlighting a model-size/bitwidth tradeoff. The study argues for a practical fusion of compensation-based methods with other PTQ strategies to achieve state-of-the-art robustness and offers actionable guidance for deploying PTQ in diverse LLM settings.

Abstract

Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation metrics.Through comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.We conduct an repository for our benchmark at https://github.com/zjq0455/PTQ_Benchmark.

Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis

TL;DR

This work introduces PTQ-Bench, a unified benchmarking framework for post-training quantization in LLMs. It provides a comprehensive taxonomy (compensation-based, rotation-based, salience-based, optimization-based), a unified evaluation across bitwidths, architectures, and modalities, and detailed comparative analyses. Key findings show that salience-based methods excel at higher bitwidths while compensation- and rotation-based strategies offer robust performance at very low bitwidths; the largest models at -bit can underperform smaller models at -bit, highlighting a model-size/bitwidth tradeoff. The study argues for a practical fusion of compensation-based methods with other PTQ strategies to achieve state-of-the-art robustness and offers actionable guidance for deploying PTQ in diverse LLM settings.

Abstract

Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation metrics.Through comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.We conduct an repository for our benchmark at https://github.com/zjq0455/PTQ_Benchmark.

Paper Structure

This paper contains 46 sections, 5 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: An overview of our paper. To provide guidelines for future research, we first establish a comprehensive taxonomy for existing milestone PTQ methods. Then we establish a novel benchmark named PTQ-Bench for evaluating several critical characteristics of foundational PTQ strategies. Based on it, extensive and unified evaluation of the categorized PTQ strategies is provided which contains a broad range of model sizes, structures, modalities and bitwidth. Finally, we summarize in-depth comparative analysis based on the experimental results and offer valuable recommendations for the advancement of LLM PTQ research.
  • Figure 2: Performance varies with model size and quantization bitwidth on LLMs. Regardless of the PTQ strategy used, the performance of a 2-bit large model is always inferior to that of a 4-bit smaller model, exemplified by 2-bit LLaMA-65B and 4-bit LLaMA-7B. In addition, 3-bit PTQ can still showcase the performance benefits associated with larger model sizes.
  • Figure 3: The number of quantization papers since 2022.