Table of Contents
Fetching ...

Scaling Laws for Post Training Quantized Large Language Models

Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang

TL;DR

This work addresses the unpredictability of post-training quantization (PTQ) for large language models by identifying scaling factors that govern PTQ performance and by building a predictive model. It analyzes pre-trained loss, local loss landscape, MX data types, and GPTQ optimization to derive empirical scaling laws, and then trains a random forest to predict the post-PTQ $\mathrm{NLL}$ given a set of features. The findings generalize across multiple model families and datasets, producing a Pareto frontier that guides the trade-off between model size and quantization precision for deployment on resource-constrained devices. The work offers a practical framework to anticipate PTQ outcomes without exhaustive trial-and-error searches, enabling more principled deployment of quantized LLMs.

Abstract

Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.

Scaling Laws for Post Training Quantized Large Language Models

TL;DR

This work addresses the unpredictability of post-training quantization (PTQ) for large language models by identifying scaling factors that govern PTQ performance and by building a predictive model. It analyzes pre-trained loss, local loss landscape, MX data types, and GPTQ optimization to derive empirical scaling laws, and then trains a random forest to predict the post-PTQ given a set of features. The findings generalize across multiple model families and datasets, producing a Pareto frontier that guides the trade-off between model size and quantization precision for deployment on resource-constrained devices. The work offers a practical framework to anticipate PTQ outcomes without exhaustive trial-and-error searches, enabling more principled deployment of quantized LLMs.

Abstract

Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.

Paper Structure

This paper contains 20 sections, 3 equations, 16 figures.

Figures (16)

  • Figure 1: Left: Scaling of pre-trained NLL loss. NLL losses evaluated on the validation split of the WikiText-2 dataset are plotted against the total parameter counts in the transformer layers' weight tensors. Model families are color-coded and the symbol sizes encode the weight parameter count, a convention shared by following figures. Right: Local radial loss landscape mapping. Shown here is measurement of the typical loss landscape in the neighborhood of pre-trained weights, by evaluation of the loss along typical radial perturbations, 3 independent instances illustrated for opt-1.3b, together with their Taylor series approximations.
  • Figure 2: Left: Local loss landscape of LLMs grouped in families. Shown are the mean (colored curves) and range (colored shades) of 3 independent measurements for each model. The typical characteristics are common to all models. Within a family, larger models tend to have flatter local loss landscape, in a predictable manner. Right: Scaling of local loss landscape as a function of LLM size. We plot NLL loss against weight parameter count, with typical perturbation SNR as a gray-scale heat map. Thin white iso-SNR curves are at 2 dB increments. With OPT family as the only exception, vertical spacing of these iso-SNR curves is shorter in large models than in small ones of the same family, suggesting flatter local minima at larger model sizes.
  • Figure 3: Left: Scaling of SQNRs and NLL losses before and after PTQ, relative to the typical loss landscape. We show data from 3 members of the OPT model family, whose parameter counts are separated by 1 order of magnitude. RTN (before PTQ, hollow symbols) and GPTQ (after PTQ, filled symbols) are plotted together with the typical radial loss landscape empirically mapped. Right: Local loss landscape underlying varied effectiveness of GPTQ acting on the same model quantized at different weight precision. Shown here are data of opt-1.3b quantized to mxint6_128, mxint4_128, mxint3_128 and mxint2_128. The colored, hollow or filled diamonds represent the SQNRs and NLL losses before and after GPTQ, respectively. We further map the underlying radial loss landscape in the directions of typical random perturbation (thin gray lines), of RTN quantization (colored dashed lines) and of GPTQ quantization (colored solid lines).
  • Figure 4: Changes in SQNRs and NLL losses resulting from GPTQ for OPT family. Numerical precision is color-coded and model size encoded by symbol size. Diagonal line represents identity.
  • Figure 5: Left: A predictive model based on random forest regression. Data for 18 models from the 5 LLM families used for predictive model fitting are shown in light gray; colored symbols represent held-out test data from mpt-7b and pythia-1b, respectively. Prediction and observation are plotted against each other for direct comparison, diagonal line marking identity. Right: Prediction of NLL losses after GPTQ, for unseen LLMs. We tested our predictive model's performance on 2 held-out LLMs from unseen model families, mpt-7b and pythia-1b. Convention follows Figure \ref{['fig:scaling_sqnr_nll_with_loss_landscape']}, with additional large circular symbols representing model prediction of GPTQ losses.
  • ...and 11 more figures