Table of Contents
Fetching ...

CPTQuant - A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models

Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo

TL;DR

CPTQuant tackles the heavy memory and compute demands of large language models by presenting three mixed-precision post-training quantization strategies—CMPQ, PMPQ, and TDMPQ—that allocate per-layer precision based on layer sensitivity. The framework formalizes a quantization objective balancing accuracy and quantization loss and demonstrates hardware-friendly, training-free implementations across BERT and OPT models, achieving up to 4x compression and 2x efficiency with minimal accuracy degradation. PMPQ excels in overall compression, while TDMPQ offers substantial gains for language modeling tasks, and CMPQ provides robust performance across model types. The work presents a practical pipeline for deploying large transformers on resource-constrained hardware, with insights into model-specific suitability and directions for extending to even larger models and post-quantization fine-tuning.

Abstract

Large language models have transformed the comprehension and generation of natural language tasks, but they come with substantial memory and computational requirements. Quantization techniques have emerged as a promising avenue for addressing these challenges while preserving accuracy and making energy efficient. We propose CPTQuant, a comprehensive strategy that introduces correlation-based (CMPQ), pruning-based (PMPQ), and Taylor decomposition-based (TDMPQ) mixed precision techniques. CMPQ adapts the precision level based on canonical correlation analysis of different layers. PMPQ optimizes precision layer-wise based on their sensitivity to sparsity. TDMPQ modifies precision using Taylor decomposition to assess each layer's sensitivity to input perturbation. These strategies allocate higher precision to more sensitive layers while diminishing precision to robust layers. CPTQuant assesses the performance across BERT, OPT-125M, OPT-350M, OPT-1.3B, and OPT-2.7B. We demonstrate up to 4x compression and a 2x-fold increase in efficiency with minimal accuracy drop compared to Hugging Face FP16. PMPQ stands out for achieving a considerably higher model compression. Sensitivity analyses across various LLMs show that the initial and final 30% of layers exhibit higher sensitivities than the remaining layers. PMPQ demonstrates an 11% higher compression ratio than other methods for classification tasks, while TDMPQ achieves a 30% greater compression ratio for language modeling tasks.

CPTQuant - A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models

TL;DR

CPTQuant tackles the heavy memory and compute demands of large language models by presenting three mixed-precision post-training quantization strategies—CMPQ, PMPQ, and TDMPQ—that allocate per-layer precision based on layer sensitivity. The framework formalizes a quantization objective balancing accuracy and quantization loss and demonstrates hardware-friendly, training-free implementations across BERT and OPT models, achieving up to 4x compression and 2x efficiency with minimal accuracy degradation. PMPQ excels in overall compression, while TDMPQ offers substantial gains for language modeling tasks, and CMPQ provides robust performance across model types. The work presents a practical pipeline for deploying large transformers on resource-constrained hardware, with insights into model-specific suitability and directions for extending to even larger models and post-quantization fine-tuning.

Abstract

Large language models have transformed the comprehension and generation of natural language tasks, but they come with substantial memory and computational requirements. Quantization techniques have emerged as a promising avenue for addressing these challenges while preserving accuracy and making energy efficient. We propose CPTQuant, a comprehensive strategy that introduces correlation-based (CMPQ), pruning-based (PMPQ), and Taylor decomposition-based (TDMPQ) mixed precision techniques. CMPQ adapts the precision level based on canonical correlation analysis of different layers. PMPQ optimizes precision layer-wise based on their sensitivity to sparsity. TDMPQ modifies precision using Taylor decomposition to assess each layer's sensitivity to input perturbation. These strategies allocate higher precision to more sensitive layers while diminishing precision to robust layers. CPTQuant assesses the performance across BERT, OPT-125M, OPT-350M, OPT-1.3B, and OPT-2.7B. We demonstrate up to 4x compression and a 2x-fold increase in efficiency with minimal accuracy drop compared to Hugging Face FP16. PMPQ stands out for achieving a considerably higher model compression. Sensitivity analyses across various LLMs show that the initial and final 30% of layers exhibit higher sensitivities than the remaining layers. PMPQ demonstrates an 11% higher compression ratio than other methods for classification tasks, while TDMPQ achieves a 30% greater compression ratio for language modeling tasks.

Paper Structure

This paper contains 20 sections, 14 equations, 9 figures, 2 tables, 3 algorithms.

Figures (9)

  • Figure 1: Visualization of Comparision of LLMs: Parameters and GPU requirement increases by 10x.
  • Figure 2: Layerwise sensitivities distribution using the CMPQ method.
  • Figure 3: Layerwise sensitivities distribution using the PMPQ method.
  • Figure 4: Layerwise Sensitivities Distribution using the TDMPQ Method.
  • Figure 5: Comparision of accuracy drop of different types of BERT models using CMPQ, PMPQ, TDMPQ with FP16.
  • ...and 4 more figures