Table of Contents
Fetching ...

DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory

Jerry Chee, Arturs Backurs, Rainie Heck, Li Zhang, Janardhan Kulkarni, Thomas Rothvoss, Sivakanth Gopi

TL;DR

DiscQuant reframes post‑training weight quantization as a discrepancy‑theory rounding problem. By leveraging gradient‑covariance structure and a Lovett‑Meka–inspired randomized rounding, it rounds all but a poly(n) number of weights to a quantization grid while preserving data distribution performance. The authors provide both theoretical insights and a practical algorithm, combining distillation‑based objectives with projection steps to stay on the feasible grid and final rounding via RTN. Empirical results on Phi‑3 mini and Meta‑Llama‑3.1‑8B‑Instruct show DiscQuant consistently surpasses GPTQ and RTN across multiple grids and quantization schemes, with code released for replication.

Abstract

Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\mathrm{poly}(1/ε)$ samples from the data distribution, we can round all but $O(m)$ model weights such that the expected approximation error of the quantized model on the true data distribution is $\le ε$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emph{DiscQuant}. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%). We make our code available at https://github.com/jerry-chee/DiscQuant.

DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory

TL;DR

DiscQuant reframes post‑training weight quantization as a discrepancy‑theory rounding problem. By leveraging gradient‑covariance structure and a Lovett‑Meka–inspired randomized rounding, it rounds all but a poly(n) number of weights to a quantization grid while preserving data distribution performance. The authors provide both theoretical insights and a practical algorithm, combining distillation‑based objectives with projection steps to stay on the feasible grid and final rounding via RTN. Empirical results on Phi‑3 mini and Meta‑Llama‑3.1‑8B‑Instruct show DiscQuant consistently surpasses GPTQ and RTN across multiple grids and quantization schemes, with code released for replication.

Abstract

Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given samples from the data distribution, we can round all but model weights such that the expected approximation error of the quantized model on the true data distribution is as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emph{DiscQuant}. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%). We make our code available at https://github.com/jerry-chee/DiscQuant.
Paper Structure (25 sections, 6 theorems, 32 equations, 10 figures, 7 tables)

This paper contains 25 sections, 6 theorems, 32 equations, 10 figures, 7 tables.

Key Result

theorem 1

If the eigenvalues of the covariance matrix of gradients decay polynomially fast, then given $m=\mathrm{poly}\left (\frac{\log n}{\varepsilon}\right)$ samples $s_1,s_2,\dots,s_m \sim \mathcal{D}_{\textrm{data}}$ there is a randomized algorithm to find $\hat{w}$ with $n-m$ weights rounded such that $

Figures (10)

  • Figure 1: An illustrative figure showing the convex polytope $K$ formed by the intersection of an $n$-dimensional hypercube $H$ and an $n-m$ dimensional affine subspace $V$. Any vertex of $K$ should have $n-m$ coordinates which are fully rounded.
  • Figure 2: Select results quantizing Phi-3-mini-4k-instruct and Meta-Llama-3.1-8B-Instruct using block scaling quantization. GSM8k is a math-based generative task, and WinoGrande and PIQA are multiple choice commonsense reasoning tasks. Error bars are standard errors from lm-evaluation-harness. See Section \ref{['sec:experiments']} for full results.
  • Figure 3: First order approximation of the error function $\Delta f$ when quantizing the model to 4.25 bits using RTN and DiscQuant. Here $f$ is the per-token loss function and $s$ is sampled from the WikiText-2 dataset.
  • Figure 4: Eigenvalues of the covariance matrix of the gradients of pre-trained models. The covariance matrix is estimated by averaging over $8k$ sample gradients from RedPajama-1T-Sample and projecting them to $2048$ dimensions using Johnson-Lindenstrauss projections.
  • Figure 5: Quantizing Phi-3-mini-4k-instruct and Meta-LLama-3.1-8B-Instruct with block scaling, and additional incoherence processing. DiscQuant can compose with other quantization improvements, and with incoherence processing remains competitive with GPTQ.
  • ...and 5 more figures

Theorems & Definitions (10)

  • theorem 1: Informal
  • theorem 4
  • theorem 5: Derived from LovettMekaFOCS12
  • proof
  • theorem 6
  • proposition 7
  • proof : Proof of Theorem \ref{['thm:MainTheorem']}
  • proof : Proof of Prop \ref{['prop:Schatten1MatrixConcentration']}
  • lemma 8
  • proof