Table of Contents
Fetching ...

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen

TL;DR

SignRoundV2 targets the practical deployment of extremely low-bit LLMs by replacing expensive retraining with a two-pronged approach: a DeltaLoss-based sensitivity metric guides adaptive bit allocation, and a lightweight pre-tuning search stabilizes quantization scales. The method uses dynamic programming to assign per-layer bit-widths under a global bit budget and emphasizes activation distortion in the sensitivity metric to reduce memory and compute. Empirical results on LLaMA and Qwen show competitive accuracy at 4–5 bits and robust performance at 2 bits, with substantially lower quantization cost than QAT-based approaches. This work offers a scalable, hardware-friendly PTQ framework capable of near full-precision performance under aggressive quantization, accelerating practical LLM deployment on resource-constrained hardware.

Abstract

Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

TL;DR

SignRoundV2 targets the practical deployment of extremely low-bit LLMs by replacing expensive retraining with a two-pronged approach: a DeltaLoss-based sensitivity metric guides adaptive bit allocation, and a lightweight pre-tuning search stabilizes quantization scales. The method uses dynamic programming to assign per-layer bit-widths under a global bit budget and emphasizes activation distortion in the sensitivity metric to reduce memory and compute. Empirical results on LLaMA and Qwen show competitive accuracy at 4–5 bits and robust performance at 2 bits, with substantially lower quantization cost than QAT-based approaches. This work offers a scalable, hardware-friendly PTQ framework capable of near full-precision performance under aggressive quantization, accelerating practical LLM deployment on resource-constrained hardware.

Abstract

Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.

Paper Structure

This paper contains 23 sections, 13 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Average accuracy of pure 2-bit (W2A16) models on Llama 2/3 70B. See detailed results in Table \ref{['tab:int_mixed_accuracy']}.
  • Figure 2: Layer-wise DeltaLoss sensitivity of Llama-3.1-8B-Instruct under W2A16 and MXFP4.