QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa
TL;DR
QuIP# advances weight-only post-training quantization for large language models by integrating a Randomized Hadamard Transform for principled incoherence processing, lattice-based vector quantization with the E8P codebook, and inter-layer fine-tuning. The combination yields superior compression at 2–4 bits per weight and enables fast inference, with 3-bit models often outperforming 4-bit baselines and 2-bit models approaching or matching higher-bit performance. The approach is validated on Llama-1/2 models across a range of sizes, showing strong perplexity and zeroshot results, and demonstrates practical speedups on modern GPUs. The work also provides a scalable path to even higher bitrates via RVQ and clarifies the trade-offs between decoding cost, codebook size, and quantization quality. Overall, QuIP# pushes PTQ boundaries, suggesting that 2-bit LLM quantization may soon outperform some higher-bit configurations in both quality and speed.
Abstract
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.
