LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid
Tianyi Zhang, Anshumali Shrivastava
TL;DR
LeanQuant tackles the high memory and compute demands of large language model quantization by learning loss-error-aware grids that preserve precision for inverse Hessian outlier weights. By combining non-uniform and affine grid learning with an efficient fused kernel, it aligns quantization with practical inference kernels and scales to models as large as 405B on modest hardware, while maintaining or improving zero-shot accuracy and perplexity compared with established baselines. The approach generalizes across common quantization formats and includes an exact OBQ-compatible variant, demonstrating substantial improvements in quality and efficiency for very large LLMs. This work enhances accessibility of open-source LLMs by enabling accurate post-training quantization on standard GPUs, reducing hardware requirements and deployment barriers.
Abstract
Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising technique to reduce memory requirements and decoding latency. However, recent accurate quantization methods often depend on specialized computations or custom data formats to achieve better model quality, which limits their compatibility with popular frameworks, as they require dedicated inference kernels tailored to specific hardware and software platforms, hindering wider adoption. Furthermore, many competitive methods have high resource requirements and computational overhead for quantizing models, making it challenging to scale them to hundreds of billions of parameters. In response to these challenges, we propose LeanQuant (Loss-Error-Aware Network Quantization), a novel quantization method that is accurate, versatile, and scalable. In the existing popular iterative loss-error-based quantization framework, we identify a critical limitation in prior methods: the min-max affine quantization grid fails to preserve model quality due to outliers in inverse Hessian diagonals. To overcome this fundamental issue, we propose learning loss-error-aware grids, instead of using non-adaptive min-max affine grids. Our approach not only produces quantized models that are more accurate but also generalizes to a wider range of quantization types, including affine and non-uniform quantization, enhancing compatibility with more frameworks. Extensive experiments with recent LLMs demonstrate that LeanQuant is highly accurate, comparing favorably against competitive baselines in model quality, and scalable, achieving very accurate quantization of Llama-3.1 405B, one of the largest open-source LLMs to date, using two Quadro RTX 8000-48GB GPUs in 21 hours.
