Table of Contents
Fetching ...

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Yifan Lu, Yerui Sun, Lin Ma, Yuchen Xie

TL;DR

Integer Scale addresses the bottleneck of fast fine-grained LLM quantization by replacing per-group float scales with integer scales controlled by an adaptive amplifier. The method, designed as a plug-in for existing post-training quantization pipelines, eliminates most costly data-type conversions and leverages kernel fusion to deliver substantial end-to-end speedups while preserving accuracy. It enables efficient quantization of challenging models such as Mixtral-8x7B and LLaMA-3, achieving notable speedups (up to around 2.3x over FP16 baselines) with minimal degradation. Overall, Integer Scale offers a practical, out-of-the-box improvement that expands the viable space of fast, low-bit-width quantization for real-world LLM deployment.

Abstract

We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

TL;DR

Integer Scale addresses the bottleneck of fast fine-grained LLM quantization by replacing per-group float scales with integer scales controlled by an adaptive amplifier. The method, designed as a plug-in for existing post-training quantization pipelines, eliminates most costly data-type conversions and leverages kernel fusion to deliver substantial end-to-end speedups while preserving accuracy. It enables efficient quantization of challenging models such as Mixtral-8x7B and LLaMA-3, achieving notable speedups (up to around 2.3x over FP16 baselines) with minimal degradation. Overall, Integer Scale offers a practical, out-of-the-box improvement that expands the viable space of fast, low-bit-width quantization for real-world LLM deployment.

Abstract

We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.
Paper Structure (31 sections, 9 equations, 8 figures, 8 tables)

This paper contains 31 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: End-to-end latency comparison of W4A8 (Integer Scale) compared with W4A8 (Float Scale) and W4A16 (Marlin) on LLaMA-2 models. The speedup ratio is written on top of the bars.
  • Figure 2: (a) Fine-grained quantization divides activation $X$ of size $M\times K$ and weight $K\times N$ into groups for separate quantization. (b) The previous float scale scheme requires numerous costly type conversions (I32toF32) from grouped matrix multiplication results, which impedes the overall performance. Our proposed scheme (c) with integer scales and automatic amplifiers (denoted as $\alpha$) alleviates the problem while retaining similar accuracy. Note $s_{ij}$ are the scales for each weight group $g_{ij}$, and $s_{ai}$ are the scales for $X$.
  • Figure 3: Kernel latency comparison between W4A8 w/ Float Scale vs. FP16. The red line denotes its acceleration ratios over FP16.
  • Figure 4: (a) The range of amplified ($\alpha=2^{10}$) float scales of LLaMA-2-7B in the first layer (others are similar) mapped to 16-bit integers. The majority of amplified scales can be represented within 8 bits. (b) The number of bit shifts required to amplify scales per linear layer. (c) Weight MSE between integer scale and float scale under different amplifiers.
  • Figure 5: (a) Fine-grained W4A8 kernel (K=4096, N=22016) with the integer scale (W4A8 Integer Scale) boosts its float scale counterpart (W4A8 Float Scale). The gray region denotes the "performance cliff". (b) End-to-end speed boost on Mixtral 8x7B over FP16 under various batch sizes.
  • ...and 3 more figures