Scaling Laws for Post Training Quantized Large Language Models
Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang
TL;DR
This work addresses the unpredictability of post-training quantization (PTQ) for large language models by identifying scaling factors that govern PTQ performance and by building a predictive model. It analyzes pre-trained loss, local loss landscape, MX data types, and GPTQ optimization to derive empirical scaling laws, and then trains a random forest to predict the post-PTQ $\mathrm{NLL}$ given a set of features. The findings generalize across multiple model families and datasets, producing a Pareto frontier that guides the trade-off between model size and quantization precision for deployment on resource-constrained devices. The work offers a practical framework to anticipate PTQ outcomes without exhaustive trial-and-error searches, enabling more principled deployment of quantized LLMs.
Abstract
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.
