Table of Contents
Fetching ...

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa

TL;DR

QuIP# advances weight-only post-training quantization for large language models by integrating a Randomized Hadamard Transform for principled incoherence processing, lattice-based vector quantization with the E8P codebook, and inter-layer fine-tuning. The combination yields superior compression at 2–4 bits per weight and enables fast inference, with 3-bit models often outperforming 4-bit baselines and 2-bit models approaching or matching higher-bit performance. The approach is validated on Llama-1/2 models across a range of sizes, showing strong perplexity and zeroshot results, and demonstrates practical speedups on modern GPUs. The work also provides a scalable path to even higher bitrates via RVQ and clarifies the trade-offs between decoding cost, codebook size, and quantization quality. Overall, QuIP# pushes PTQ boundaries, suggesting that 2-bit LLM quantization may soon outperform some higher-bit configurations in both quality and speed.

Abstract

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

TL;DR

QuIP# advances weight-only post-training quantization for large language models by integrating a Randomized Hadamard Transform for principled incoherence processing, lattice-based vector quantization with the E8P codebook, and inter-layer fine-tuning. The combination yields superior compression at 2–4 bits per weight and enables fast inference, with 3-bit models often outperforming 4-bit baselines and 2-bit models approaching or matching higher-bit performance. The approach is validated on Llama-1/2 models across a range of sizes, showing strong perplexity and zeroshot results, and demonstrates practical speedups on modern GPUs. The work also provides a scalable path to even higher bitrates via RVQ and clarifies the trade-offs between decoding cost, codebook size, and quantization quality. Overall, QuIP# pushes PTQ boundaries, suggesting that 2-bit LLM quantization may soon outperform some higher-bit configurations in both quality and speed.

Abstract

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ( 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.
Paper Structure (44 sections, 9 theorems, 50 equations, 5 figures, 10 tables, 5 algorithms)

This paper contains 44 sections, 9 theorems, 50 equations, 5 figures, 10 tables, 5 algorithms.

Key Result

Lemma 3.0

Let $H$ be any positive semidefinite matrix on $\mathbb{R}^{n \times n}$ and $W$ any weight matrix on $\mathbb{R}^{m \times n}$. Let $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ be orthogonal scaled Hadamard matrices. Let $S_U \in \mathbb{R}^{m \times m}$ and $S_V \in \mathbb{

Figures (5)

  • Figure 1: QuIP$\#$ offers unprecedented quantization quality at extreme compression ratios. QuIP$\#$ 3-bit models also scale better than theoretically lossless 4-bit models, a previously unseen result.
  • Figure 2: QuIP$\#$ performs incoherence processing with a Randomized Hadamard Transform and uses lattice codebooks to achieve state-of-the-art quantized models.
  • Figure 3: Minimum achievable elementwise MSE of quantizing a Gaussian to various codebooks. $E_8$-based codebooks outperform other presented codebooks due to the underlying packing density and high dimensionality of $E_8$.
  • Figure 4: QuIP$\#$ scaling, Llama 1. Like Llama 2, QuIP$\#$ 3 bit scales better than QuIP$\#$ 4 bit for Llama 1 models and QuIP$\#$ 2 bit scales similarly to higher bitrates.
  • Figure 5: QuIP$\#$ scaling. (Top Left) Llama 2 Wikitext 2 perplexity vs AQLM. Context length 4096. QuIP$\#$ 2 and 3 bit scale better than AQLM 2 and 3 bit. (Top Right) Llama 2 C4 Perplexity. Context length 4096. (Bottom) Llama 1 C4 Perplexity. Context length 2048.

Theorems & Definitions (17)

  • Definition 2.1: chee2023quip
  • Lemma 3.0
  • Theorem 4.1
  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • proof
  • Lemma 1.4
  • ...and 7 more