Table of Contents
Fetching ...

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Jiale Chen, Yalda Shabanzadeh, Elvir Crnčević, Torsten Hoefler, Dan Alistarh

TL;DR

The paper reframes GPTQ quantization of LLM weights as a lattice problem and proves that back-to-front GPTQ exactly implements Babai’s nearest-plane algorithm on the Hessian-defined lattice, yielding a principled no-clipping error bound. This geometric view connects GPTQ to the Closest Vector Problem (CVP) and enables importing decades of lattice techniques to quantization, including an LDL/UDU-based analysis of error propagation and ordering strategies. Beyond theory, the authors introduce overflow-tolerant variants (SSQR and HPTQ) and provide optimized CUDA kernels, achieving improved perplexities and end-to-end speedups on large models. The work thereby grounds GPTQ in solid lattice-theoretic guarantees while delivering practical, scalable post-training quantization methods for billion-parameter transformers.

Abstract

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

TL;DR

The paper reframes GPTQ quantization of LLM weights as a lattice problem and proves that back-to-front GPTQ exactly implements Babai’s nearest-plane algorithm on the Hessian-defined lattice, yielding a principled no-clipping error bound. This geometric view connects GPTQ to the Closest Vector Problem (CVP) and enables importing decades of lattice techniques to quantization, including an LDL/UDU-based analysis of error propagation and ordering strategies. Beyond theory, the authors introduce overflow-tolerant variants (SSQR and HPTQ) and provide optimized CUDA kernels, achieving improved perplexities and end-to-end speedups on large models. The work thereby grounds GPTQ in solid lattice-theoretic guarantees while delivering practical, scalable post-training quantization methods for billion-parameter transformers.

Abstract

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

Paper Structure

This paper contains 30 sections, 7 theorems, 63 equations, 12 figures, 11 tables, 12 algorithms.

Key Result

Theorem 1

The CVPs using any possible factors $\bm{\mathcal{X}}$ of the Hessian matrix $\bm{X}^\top \bm{X}$ are equivalent under an orthogonal transformation (rotation and sign changes) of the lattice and external target vector.

Figures (12)

  • Figure 1: Upper row: (a) CVP in a two-dimensional lattice; (b) Basis reduction can find a shorter, more orthogonal basis that can potentially improve the results; (c-d) The projection steps in Babai's nearest plane algorithm. Lower row: rounding boundaries of (e) optimal rounding or Voronoi cells; (f) round-to-nearest (RTN); (g) Babai's nearest plane algorithm without basis reduction; (h) Babai's algorithm without basis reduction under the reversely ordered basis.
  • Figure 2: Equivalence of OBQ's error propagation and Babai's projection. (a) 3D plot showing the target being projected onto the nearest plane. (b) 3D plot showing how the projection error is propagated. (c) 2D plot showing the vectors on the nearest hyperplane in (a-b). (d) 2D plot showing the vectors on the orthogonal projection plane in (b).
  • Figure 3: Geometric interpretation of OBQ's quantization order. This 2D plot shows the target being projected onto the nearest plane.
  • Figure 4: (a) Comparison of quantization methods (RTN, GPTQ, HRTN, HPTQ, and SSQR with 1 5% outliers) on Qwen3-8B evaluated on WikiText-2. Perplexity is plotted against the average effective bitwidth per weight, with the BF16 baseline shown as a horizontal line. HPTQ has the best (lowest) perplexity. See \ref{['sec:app_experiments_metrics']} for zero-shot evaluation results. (b) Scaling behavior of HPTQ across multiple model sizes (0.6B, 1.7B, 4B, 8B, 14B) and bitwidths (4.125, 3.125, 2.125). The x-axis denotes the effective model size after quantization, and the y-axis shows perplexity on WikiText-2. Each curve corresponds to a fixed bitwidth, while points along a curve represent different model scales. Using our HPTQ method, 3.125-bit stands out as the Pareto optimal bitwidth (optimal perplexity vs compression trade-offs). (c) End-to-end inference speedups of our SSQR kernel vs the PyTorch BF16 matrix multiplication kernel on NVIDIA RTX A6000 GPU. We run the Qwen3-8B model across multiple outlier rates (0% 5%) and inlier bitwidths (4, 3, 2) and measure the TPOT (time per output token) metric. Our kernel achieves about 2$\times$ speedup end-to-end.
  • Figure 5: Layer-wise inference speedup of the SSQR kernel over the PyTorch BF16 baseline on Qwen3-8B across inlier bitwidths, outlier rates, and batch sizes on A6000 GPU.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Theorem 1: Quantization and CVP
  • Theorem 2: Error Propagation and Babai's projection
  • Corollary 3: OBQ Dimension Selection
  • Theorem 4: GPTQ and Babai
  • Theorem 5: GPTQ Error Bound
  • Theorem 6: Babai's Quantization Order
  • Lemma 7