AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs
Alireza Ghaffari, Sharareh Younesian, Vahid Partovi Nia, Boxing Chen, Masoud Asgharian
TL;DR
The paper addresses the need for calibration-free PTQ of LLMs by proposing AdpQ, a zero-shot method that uses Adaptive LASSO to identify and separate outlier weights and apply adaptive soft-thresholding with mixed-precision quantization. The authors show theoretically that this approach minimizes KL-divergence between the original and quantized weight distributions and preserves information via Jensen–Shannon centroids when outliers are separated. Empirically, AdpQ achieves state-of-the-art accuracy at 3- to 4-bit quantization with an order-of-magnitude speedup over existing PTQ methods (at least 10x faster than AWQ and up to 100x faster than SpQR) while maintaining privacy by avoiding calibration data. The method is hardware-friendly, easily implementable, and broadly applicable to LLM families, offering practical benefits for efficient deployment without sacrificing performance.
Abstract
The ever-growing computational complexity of Large Language Models (LLMs) necessitates efficient deployment strategies. The current state-of-the-art approaches for Post-training Quantization (PTQ) often require calibration to achieve the desired accuracy. This paper presents AdpQ, a novel zero-shot adaptive PTQ method for LLMs that achieves the state-of-the-art performance in low-precision quantization (e.g. 3-bit) without requiring any calibration data. Inspired by Adaptive LASSO regression model, our proposed approach tackles the challenge of outlier activations by separating salient weights using an adaptive soft-thresholding method. Guided by Adaptive LASSO, this method ensures that the quantized weights distribution closely follows the originally trained weights and eliminates the need for calibration data entirely, setting our method apart from popular approaches such as SpQR and AWQ. Furthermore, our method offers an additional benefit in terms of privacy preservation by eliminating any calibration or training data. We also delve deeper into the information-theoretic underpinnings of the proposed method. We demonstrate that it leverages the Adaptive LASSO to minimize the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing efficient deployment without sacrificing accuracy or information. Our results achieve the same accuracy as the existing methods on various LLM benchmarks while the quantization time is reduced by at least 10x, solidifying our contribution to efficient and privacy-preserving LLM deployment.
