Table of Contents
Fetching ...

SingleQuant: Efficient Quantization of Large Language Models in a Single Pass

Jinying Xiao, Bin Ji, Shasha Li, Xiaodong Liu, Ma Jun, Ye Zhong, Wei Li, Xuan Xie, Qingbo Wu, Jie Yu

TL;DR

SingleQuant addresses convergence and efficiency bottlenecks in post-training LLM quantization by decoupling quantization from optimization. It uses two closed-form rotation modules, ART for outlier smoothing and URT for distribution uniformity, implemented via a Kronecker-structured rotation to enable single-pass, gradient-free quantization. Empirical results show dramatic speedups (up to ~1420×) and improved or competitive task performance across 4-bit weight-activation quantization on 7B–70B models, establishing new state-of-the-art results. This approach significantly enhances practical deployment of large language models in resource-limited environments.

Abstract

Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs' task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct activation outliers, where ART achieves smoothing of outlier values via closed-form optimal rotations, and URT reshapes distributions through geometric mapping. Both matrices comprise strictly formulated Givens rotations with predetermined dimensions and rotation angles, enabling promising LLMs task performance within a short time. Experimental results demonstrate SingleQuant's superiority over the selected baselines across diverse tasks on 7B-70B LLMs. To be more precise, SingleQuant enables quantized LLMs to achieve higher task performance while necessitating less time for quantization. For example, when quantizing LLaMA-2-13B, SingleQuant achieves 1,400$\times$ quantization speedup and increases +0.57\% average task performance compared to the selected best baseline.

SingleQuant: Efficient Quantization of Large Language Models in a Single Pass

TL;DR

SingleQuant addresses convergence and efficiency bottlenecks in post-training LLM quantization by decoupling quantization from optimization. It uses two closed-form rotation modules, ART for outlier smoothing and URT for distribution uniformity, implemented via a Kronecker-structured rotation to enable single-pass, gradient-free quantization. Empirical results show dramatic speedups (up to ~1420×) and improved or competitive task performance across 4-bit weight-activation quantization on 7B–70B models, establishing new state-of-the-art results. This approach significantly enhances practical deployment of large language models in resource-limited environments.

Abstract

Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs' task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct activation outliers, where ART achieves smoothing of outlier values via closed-form optimal rotations, and URT reshapes distributions through geometric mapping. Both matrices comprise strictly formulated Givens rotations with predetermined dimensions and rotation angles, enabling promising LLMs task performance within a short time. Experimental results demonstrate SingleQuant's superiority over the selected baselines across diverse tasks on 7B-70B LLMs. To be more precise, SingleQuant enables quantized LLMs to achieve higher task performance while necessitating less time for quantization. For example, when quantizing LLaMA-2-13B, SingleQuant achieves 1,400 quantization speedup and increases +0.57\% average task performance compared to the selected best baseline.

Paper Structure

This paper contains 11 sections, 2 theorems, 17 equations, 4 figures, 5 tables.

Key Result

Theorem 1

Under a finite step size, the Stiefel gradient norm of Cayley SGD satisfies: (Proof and detailed discussion can be found in Appendix A)

Figures (4)

  • Figure 1: The sub-figure (a) compares SingleQuant and SpinQuant in quantization speed (LLMs quantization per hour), end-to-end speedup, and performance under various QA tasks. The sub-figure (b) illustrates deterministic outlier smoothing via Givens rotation on 2D data containing MO. Grey circles represent data points, with red ones indicating MO. Blue ellipses depict quantization space (size determined by bit-width), where greater coverage of data points within this fixed space indicates higher quantization space utilization. The sub-figure (c) presents SingleQuant's framework comprising two components: ART smooths outlier magnitudes targeting prominent/scattered outliers, while URT performs secondary smoothing through distribution optimization. The diagram demonstrates ART/URT operations against distinct outlier types.
  • Figure 2: SpinQuant applies W4A4 quantization to LLaMA-2-7B with linearly decaying LR. The orange curve uses 10$\times$ SpinQuant’s claimed iterations; adjacent green points are spaced 10 iterations apart. The figure shows optimization loss and gradient norm. More model results can be seen in Appendix C.
  • Figure 3: Prefill and decoding speedup of LLaMA-2-7B model across different batch sizes. We decode 256 tokens after the prefill on a sequence length of 2048.
  • Figure 4: Performance comparisons of ART on SingleQuant through multiple optimization steps.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2