ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
Yesheng Liang, Haisheng Chen, Song Han, Zhijian Liu
TL;DR
ParoQuant introduces a weight-only PTQ method for reasoning LLMs that combines independent Givens rotations with channel-wise scaling to suppress weight outliers and narrow per-group dynamic range. The approach is hardware-aware, featuring a fused CUDA kernel and a layer-wise optimization strategy that keeps decoding latency minimal. It achieves state-of-the-art accuracy among linear quantization methods on multiple models and reasoning tasks, with an average improvement over AWQ and competitive performance against QTIP while reducing overhead. This work enables high-fidelity quantization for long-chain reasoning in LLMs, offering practical benefits for on-device and efficient inference.
Abstract
Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.
