The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Jiale Chen; Yalda Shabanzadeh; Elvir Crnčević; Torsten Hoefler; Dan Alistarh

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Jiale Chen, Yalda Shabanzadeh, Elvir Crnčević, Torsten Hoefler, Dan Alistarh

TL;DR

The paper reframes GPTQ quantization of LLM weights as a lattice problem and proves that back-to-front GPTQ exactly implements Babai’s nearest-plane algorithm on the Hessian-defined lattice, yielding a principled no-clipping error bound. This geometric view connects GPTQ to the Closest Vector Problem (CVP) and enables importing decades of lattice techniques to quantization, including an LDL/UDU-based analysis of error propagation and ordering strategies. Beyond theory, the authors introduce overflow-tolerant variants (SSQR and HPTQ) and provide optimized CUDA kernels, achieving improved perplexities and end-to-end speedups on large models. The work thereby grounds GPTQ in solid lattice-theoretic guarantees while delivering practical, scalable post-training quantization methods for billion-parameter transformers.

Abstract

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

TL;DR

Abstract

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (7)