Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Qingyuan Li; Ran Meng; Yiduo Li; Bo Zhang; Yifan Lu; Yerui Sun; Lin Ma; Yuchen Xie

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Yifan Lu, Yerui Sun, Lin Ma, Yuchen Xie

TL;DR

Integer Scale addresses the bottleneck of fast fine-grained LLM quantization by replacing per-group float scales with integer scales controlled by an adaptive amplifier. The method, designed as a plug-in for existing post-training quantization pipelines, eliminates most costly data-type conversions and leverages kernel fusion to deliver substantial end-to-end speedups while preserving accuracy. It enables efficient quantization of challenging models such as Mixtral-8x7B and LLaMA-3, achieving notable speedups (up to around 2.3x over FP16 baselines) with minimal degradation. Overall, Integer Scale offers a practical, out-of-the-box improvement that expands the viable space of fast, low-bit-width quantization for real-world LLM deployment.

Abstract

We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

TL;DR

Abstract

Paper Structure (31 sections, 9 equations, 8 figures, 8 tables)

This paper contains 31 sections, 9 equations, 8 figures, 8 tables.

Introduction
Related Work
LLM Serving Frameworks and Optimization Techniques
LLM Quantization Algorithms
Motivation
Fine Granularity Strengthens Current Quantization Approaches
Fine-grained Quantization Suffers from the Inference Bottleneck
Method
Integer Scale with Adaptive Scale Amplifier
Kernel Implementation
Experiments
Setup
Experiment Result on LAMBADA, C4, and WikiText-2
Experiment Result on Common Sense QA
W4A8 Kernel Latency Comparison
...and 16 more sections

Figures (8)

Figure 1: End-to-end latency comparison of W4A8 (Integer Scale) compared with W4A8 (Float Scale) and W4A16 (Marlin) on LLaMA-2 models. The speedup ratio is written on top of the bars.
Figure 2: (a) Fine-grained quantization divides activation $X$ of size $M\times K$ and weight $K\times N$ into groups for separate quantization. (b) The previous float scale scheme requires numerous costly type conversions (I32toF32) from grouped matrix multiplication results, which impedes the overall performance. Our proposed scheme (c) with integer scales and automatic amplifiers (denoted as $\alpha$) alleviates the problem while retaining similar accuracy. Note $s_{ij}$ are the scales for each weight group $g_{ij}$, and $s_{ai}$ are the scales for $X$.
Figure 3: Kernel latency comparison between W4A8 w/ Float Scale vs. FP16. The red line denotes its acceleration ratios over FP16.
Figure 4: (a) The range of amplified ($\alpha=2^{10}$) float scales of LLaMA-2-7B in the first layer (others are similar) mapped to 16-bit integers. The majority of amplified scales can be represented within 8 bits. (b) The number of bit shifts required to amplify scales per linear layer. (c) Weight MSE between integer scale and float scale under different amplifiers.
Figure 5: (a) Fine-grained W4A8 kernel (K=4096, N=22016) with the integer scale (W4A8 Integer Scale) boosts its float scale counterpart (W4A8 Float Scale). The gray region denotes the "performance cliff". (b) End-to-end speed boost on Mixtral 8x7B over FP16 under various batch sizes.
...and 3 more figures

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

TL;DR

Abstract

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)