Table of Contents
Fetching ...

When Quantization Affects Confidence of Large Language Models?

Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin

TL;DR

This work investigates how post-training quantization to 4-bit weights using GPTQ affects the confidence and calibration of large language models. It analyzes predictive probability distributions across multiple models and datasets by comparing compressed and full-precision variants via calibration error, confidence metrics, and distribution distances like Jensen-Shannon divergence. The results indicate that quantization can degrade calibration and shift confidence for both correct and incorrect predictions, with larger effects on uncertain samples and partial mitigation as model size grows. These findings motivate calibration-aware quantization pipelines and cross-family benchmarking to better understand and mitigate quantization-induced confidence loss.

Abstract

Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.

When Quantization Affects Confidence of Large Language Models?

TL;DR

This work investigates how post-training quantization to 4-bit weights using GPTQ affects the confidence and calibration of large language models. It analyzes predictive probability distributions across multiple models and datasets by comparing compressed and full-precision variants via calibration error, confidence metrics, and distribution distances like Jensen-Shannon divergence. The results indicate that quantization can degrade calibration and shift confidence for both correct and incorrect predictions, with larger effects on uncertain samples and partial mitigation as model size grows. These findings motivate calibration-aware quantization pipelines and cross-family benchmarking to better understand and mitigate quantization-induced confidence loss.

Abstract

Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
Paper Structure (22 sections, 5 equations, 3 figures, 5 tables)

This paper contains 22 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Quantization-induced absolute confidence shifts in original (pre-compression) low and high confidence samples (BLOOM and OPT models, HellaSwag benchmark). The bin with the largest mean confidence shift is highlighted.
  • Figure 2: Mean Jensen-Shannon distances between full and quantized LLMs across benchmarks. The distances depict dissimilarities in true-class probability distributions.
  • Figure 3: Confidence difference for models across datasets. For each dataset (in column) and each model (in line), we provide the difference in prediction scores between the full and quantized models. More precisely, each bar represents the mean difference in confidence between the quantized and full models, with confidence in the full model represented on the horizontal axis. Note that some ranges start from $0.5$ for binary tasks and $0.25$ for multi-class (with four classes) tasks. For a confidence lower than the previous one, there is no chance of being assigned to the associated class.