When Quantization Affects Confidence of Large Language Models?
Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin
TL;DR
This work investigates how post-training quantization to 4-bit weights using GPTQ affects the confidence and calibration of large language models. It analyzes predictive probability distributions across multiple models and datasets by comparing compressed and full-precision variants via calibration error, confidence metrics, and distribution distances like Jensen-Shannon divergence. The results indicate that quantization can degrade calibration and shift confidence for both correct and incorrect predictions, with larger effects on uncertain samples and partial mitigation as model size grows. These findings motivate calibration-aware quantization pipelines and cross-family benchmarking to better understand and mitigate quantization-induced confidence loss.
Abstract
Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
