Table of Contents
Fetching ...

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang, Haisheng Chen, Song Han, Zhijian Liu

TL;DR

ParoQuant introduces a weight-only PTQ method for reasoning LLMs that combines independent Givens rotations with channel-wise scaling to suppress weight outliers and narrow per-group dynamic range. The approach is hardware-aware, featuring a fused CUDA kernel and a layer-wise optimization strategy that keeps decoding latency minimal. It achieves state-of-the-art accuracy among linear quantization methods on multiple models and reasoning tasks, with an average improvement over AWQ and competitive performance against QTIP while reducing overhead. This work enables high-fidelity quantization for long-chain reasoning in LLMs, offering practical benefits for on-device and efficient inference.

Abstract

Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

TL;DR

ParoQuant introduces a weight-only PTQ method for reasoning LLMs that combines independent Givens rotations with channel-wise scaling to suppress weight outliers and narrow per-group dynamic range. The approach is hardware-aware, featuring a fused CUDA kernel and a layer-wise optimization strategy that keeps decoding latency minimal. It achieves state-of-the-art accuracy among linear quantization methods on multiple models and reasoning tasks, with an average improvement over AWQ and competitive performance against QTIP while reducing overhead. This work enables high-fidelity quantization for long-chain reasoning in LLMs, offering practical benefits for on-device and efficient inference.

Abstract

Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

Paper Structure

This paper contains 40 sections, 9 equations, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: Effect of optimized channel-wise scaling and rotations. Left: Magnitude of the k_proj weight in the first layer of LLaMA-3-8B grattafiori2024llama before and after the transform. The outlier channels have been eliminated effectively. Right: Scatter of two channels of the weight matrix before and after the transform. In addition to scaling, which concentrates values of the entire channel, rotations draw values from different channels closer at each token (clustering around the $x=y$ line).
  • Figure 2: Loss curves from optimizing transforms to minimize quantization-induced output error ($\Vert \mathbf XQ(\mathbf W) - \mathbf {XW} \Vert$) for the k_proj weight matrix in the first layer of LLaMA-3-8B. Rotations can minimize quantization error better than channel-wise scaling, and keeping the 10% most significant pairs is equally expressive as a full rotation. See Section \ref{['sec:effectiveness-analysis']} for more details.
  • Figure 3: Speedup of scaled pairwise rotation over the Hadamard transform on an RTX A6000.
  • Figure 4: Perplexity ($\downarrow$) results of 4-bit models. The context length is 8192 for LLaMA-3 and Qwen3 (base models), and 4096 for LLaMA-2. The best results among linear quantization methods are in bold. Speedup over FP16 models is reported as the geometric mean across Q3-1.7, Q3-4, L3-8, and Q3-14, measured on an RTX A6000 with a batch size of 1 during decoding.

Theorems & Definitions (2)

  • Definition 1: Independent Pairs
  • Definition 2: Independent Rotation