Table of Contents
Fetching ...

Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach

Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia, Masoud Asgharian

TL;DR

This work reframes post-training quantization by introducing a weight-adaptive pre-calibration step that minimizes the KL divergence between original and quantized weight distributions to preserve Shannon information. By modeling weight classification with Adaptive LASSO and employing pseudo activations, the method yields a computationally efficient, calibration-free pre-calibration that serves as a robust precursor to traditional PTQ calibration. Empirical results on various LLMs show competitive accuracy and significantly faster quantization times, with perplexity and zero-shot task performance approaching or matching calibration-based PTQ methods. The approach offers a practical, information-theoretic alternative that improves robustness across diverse deployment environments and can be integrated with existing PTQ pipelines for further gains.

Abstract

As Large Language Models (LLMs) become increasingly computationally complex, developing efficient deployment strategies, such as quantization, becomes crucial. State-of-the-art Post-training Quantization (PTQ) techniques often rely on calibration processes to maintain the accuracy of these models. However, while these calibration techniques can enhance performance in certain domains, they may not be as effective in others. This paper aims to draw attention to robust statistical approaches that can mitigate such issues. We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods, guiding the quantization process to preserve the distribution of weights by minimizing the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures that the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing robust and efficient deployment across many tasks. As such, our proposed approach can perform on par with most common calibration-based PTQ methods, establishing a new pre-calibration step for further adjusting the quantized weights with calibration. We show that our pre-calibration results achieve the same accuracy as some existing calibration-based PTQ methods on various LLMs.

Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach

TL;DR

This work reframes post-training quantization by introducing a weight-adaptive pre-calibration step that minimizes the KL divergence between original and quantized weight distributions to preserve Shannon information. By modeling weight classification with Adaptive LASSO and employing pseudo activations, the method yields a computationally efficient, calibration-free pre-calibration that serves as a robust precursor to traditional PTQ calibration. Empirical results on various LLMs show competitive accuracy and significantly faster quantization times, with perplexity and zero-shot task performance approaching or matching calibration-based PTQ methods. The approach offers a practical, information-theoretic alternative that improves robustness across diverse deployment environments and can be integrated with existing PTQ pipelines for further gains.

Abstract

As Large Language Models (LLMs) become increasingly computationally complex, developing efficient deployment strategies, such as quantization, becomes crucial. State-of-the-art Post-training Quantization (PTQ) techniques often rely on calibration processes to maintain the accuracy of these models. However, while these calibration techniques can enhance performance in certain domains, they may not be as effective in others. This paper aims to draw attention to robust statistical approaches that can mitigate such issues. We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods, guiding the quantization process to preserve the distribution of weights by minimizing the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures that the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing robust and efficient deployment across many tasks. As such, our proposed approach can perform on par with most common calibration-based PTQ methods, establishing a new pre-calibration step for further adjusting the quantized weights with calibration. We show that our pre-calibration results achieve the same accuracy as some existing calibration-based PTQ methods on various LLMs.
Paper Structure (19 sections, 13 equations, 6 tables, 1 algorithm)