LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Tianyi Zhang; Anshumali Shrivastava

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Tianyi Zhang, Anshumali Shrivastava

TL;DR

LeanQuant tackles the high memory and compute demands of large language model quantization by learning loss-error-aware grids that preserve precision for inverse Hessian outlier weights. By combining non-uniform and affine grid learning with an efficient fused kernel, it aligns quantization with practical inference kernels and scales to models as large as 405B on modest hardware, while maintaining or improving zero-shot accuracy and perplexity compared with established baselines. The approach generalizes across common quantization formats and includes an exact OBQ-compatible variant, demonstrating substantial improvements in quality and efficiency for very large LLMs. This work enhances accessibility of open-source LLMs by enabling accurate post-training quantization on standard GPUs, reducing hardware requirements and deployment barriers.

Abstract

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising technique to reduce memory requirements and decoding latency. However, recent accurate quantization methods often depend on specialized computations or custom data formats to achieve better model quality, which limits their compatibility with popular frameworks, as they require dedicated inference kernels tailored to specific hardware and software platforms, hindering wider adoption. Furthermore, many competitive methods have high resource requirements and computational overhead for quantizing models, making it challenging to scale them to hundreds of billions of parameters. In response to these challenges, we propose LeanQuant (Loss-Error-Aware Network Quantization), a novel quantization method that is accurate, versatile, and scalable. In the existing popular iterative loss-error-based quantization framework, we identify a critical limitation in prior methods: the min-max affine quantization grid fails to preserve model quality due to outliers in inverse Hessian diagonals. To overcome this fundamental issue, we propose learning loss-error-aware grids, instead of using non-adaptive min-max affine grids. Our approach not only produces quantized models that are more accurate but also generalizes to a wider range of quantization types, including affine and non-uniform quantization, enhancing compatibility with more frameworks. Extensive experiments with recent LLMs demonstrate that LeanQuant is highly accurate, comparing favorably against competitive baselines in model quality, and scalable, achieving very accurate quantization of Llama-3.1 405B, one of the largest open-source LLMs to date, using two Quadro RTX 8000-48GB GPUs in 21 hours.

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

TL;DR

Abstract

Paper Structure (37 sections, 15 equations, 5 figures, 15 tables, 2 algorithms)

This paper contains 37 sections, 15 equations, 5 figures, 15 tables, 2 algorithms.

Introduction
Background
Quantization Grid
Iterative Loss-error-based Quantization
Methodology
Revisiting the Loss Error
Loss-Error-Aware Network Quantization
Non-Uniform Loss-Error-Aware Grid
Non-Uniform Loss-Error-Aware Grid
Loss-Error-Aware Affine Grid
LeanQuant
Experiments
Main Results
Memory and Time Efficiency
Ablation Study
...and 22 more sections

Figures (5)

Figure 1: (Left) The empirical distributions of inverse Hessian diagonals, computed on 262K tokens from the C4 dataset for the Llama-3-8B model, contain outliers that can cause high loss errors. (Right) Our proposed loss-error-aware non-uniform and affine grids better preserve the quantized precision of outliers, leading to more accurate quantized models.
Figure 2: Comparison of loss errors $\epsilon$, summed over each layer, for GPTQ and LeanQuant (affine and non-uniform) during iterative quantization.
Figure 3: Comparison of affine (left) and non-uniform (right) 2-bit quantization grids applied to the weights in the first MLP-down layer of Llama-3-8B. The affine grid uses evenly spaced quantization grid points between the minimum and maximum weights. In contrast, the non-uniform grid allows grid points to be placed flexibly, as their positions are stored in a look-up table. This enables finer quantization in dense regions and coarser quantization in sparse regions, better aligning with the weight distribution and reducing quantization error.
Figure 4: Evaluation of quantized Llama-3-8B-Instruct on MT-Bench using OpenAI GPT-4o as a judge. The win rates reported exclude ties.
Figure 5: Comparison of loss errors $\epsilon$ of each layer for GPTQ and LeanQuant (affine and non-uniform) during iterative quantization.

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

TL;DR

Abstract

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Authors

TL;DR

Abstract

Table of Contents

Figures (5)