Table of Contents
Fetching ...

Towards Superior Quantization Accuracy: A Layer-sensitive Approach

Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, Quan Bai

TL;DR

This work addresses the inefficiency of uniform quantization in large language models by introducing a layer-sensitive approach. It leverages Activation Sensitivity and weight distribution Kurtosis to identify layers that are particularly susceptible to quantization error and allocates additional memory budgets to them via SensiBoost and KurtBoost. Empirical results across Llama models show up to a 9% reduction in perplexity with only a ~2% increase in memory, outperforming state-of-the-art calibration-free baselines. The approach enables more accurate post-training quantization with minimal overhead, promoting practical, scalable deployment of large transformers.

Abstract

Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Towards Superior Quantization Accuracy: A Layer-sensitive Approach

TL;DR

This work addresses the inefficiency of uniform quantization in large language models by introducing a layer-sensitive approach. It leverages Activation Sensitivity and weight distribution Kurtosis to identify layers that are particularly susceptible to quantization error and allocates additional memory budgets to them via SensiBoost and KurtBoost. Empirical results across Llama models show up to a 9% reduction in perplexity with only a ~2% increase in memory, outperforming state-of-the-art calibration-free baselines. The approach enables more accurate post-training quantization with minimal overhead, promoting practical, scalable deployment of large transformers.

Abstract

Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Paper Structure

This paper contains 18 sections, 17 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: This figure demonstrates the relationship between quantization methods (HQQ, RTN, BnB), datasets (WikiText2, C4 pileval, BoS) and layer-wise sensitivity. The distinct shapes of sensitivity curves for Llama-2-7B and Llama-3-8B models indicate the sensitivity property is model dependent. Meanwhile, the near identical patterns across calibration datasets and quantization methods show that layer-wise sensitivity to quantization error is independent of calibration datasets and quantization methods. For optimal clarity, the figure is best viewed in color and with zoom.
  • Figure 2: This figure illustrates how bit budget influnces layer-wise sensitivity. The magnitude of sensitivity varies among the 3-bit, 4-bit, and 8-bit groups. The 4-bit and 8-bit groups show larger difference as indicated by the wider blank. However, the overall patterns of the three bit groups demonstrate close resemblance. For optimal clarity, the figure is best viewed in color and with zoom.
  • Figure 3: This figure presents the sensitivity patterns among the Llama-2-7B base model and its fine-tuned mutations. Two fine-tuned models are included for comparison. The middle one is Llama-2-7B-chat. And the right is the meditron-7B which is a medical LLM fine-tuned on a carefully curated medical corpus. As indicated by the nearly identical shapes of sensitivity curves, the two fine-tuned models clearly inherit the sensitivity properties from the base model. For optimal clarity, the figure is best viewed in color and with zoom.
  • Figure 4: This figure illustrates the win-tie-loss performance of the SensiBoost (denoted as "SB") and KurtBoost (denoted as "KB") methods compared to their ablation test (labeled as "ABL") as well as the baseline methods HQQ and MXQ, across three Llama models. As anticipated, SensiBoost and KurtBoost outperform the baseline methods HQQ and MXQ due to the allocation of additional bit budgets. However, their relatively low win rates (53% against HQQ and 70% against MXQ in the case of SensiBoost, 66% against HQQ and 75% against MXQ for KurtBoost) on the Llama-2-13B model suggest that achieving significant improvements in larger models with a limited extra memory budget is challenging. SensiBoost consistently outperforms its ablation test variant. However, its comparison with the KurtBoost method reveals mixed outcomes: while SensiBoost underperforms on the two Llama-2 models, it demonstrates considerable advantages on the Llama-3-8B model. For optimal clarity, the figure is best viewed in color and with zoom.
  • Figure 5: This figure illustrates the perplexity performance of the SensiBoost and KurtBoost approaches evaluated on the Llama-2-13B model using the WikiText2 dataset. The green triangles, representing the SensiBoost method, are positioned closer to the y-axis, indicating that SensiBoost requires less additional memory to achieve comparable performance to KurtBoost. Notably, SensiBoost exhibits a slight advantage over KurtBoost, requiring approximately 2% more bit budget to attain a near-minimal perplexity score, as emphasized in the magnified sub-plot. For optimal interpretation, the figure is best viewed in color and with zoom.
  • ...and 4 more figures