SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen
TL;DR
SignRoundV2 targets the practical deployment of extremely low-bit LLMs by replacing expensive retraining with a two-pronged approach: a DeltaLoss-based sensitivity metric guides adaptive bit allocation, and a lightweight pre-tuning search stabilizes quantization scales. The method uses dynamic programming to assign per-layer bit-widths under a global bit budget and emphasizes activation distortion in the sensitivity metric to reduce memory and compute. Empirical results on LLaMA and Qwen show competitive accuracy at 4–5 bits and robust performance at 2 bits, with substantially lower quantization cost than QAT-based approaches. This work offers a scalable, hardware-friendly PTQ framework capable of near full-precision performance under aggressive quantization, accelerating practical LLM deployment on resource-constrained hardware.
Abstract
Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.
