Table of Contents
Fetching ...

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, Sifan Zhou

TL;DR

This work tackles the challenge of fully integer-only inference for large language models by addressing activation fluctuations that hamper quantization of both linear and non-linear operations. It introduces three core components—Fully-Smooth Block-Reconstruction (FSBR) to harmonize activations across channels, Dynamic Integer-only MatMul (DI-MatMul) to handle inter-token variation with dynamic integer quantization, and a set of lightweight dynamic non-linear operators (DI-ClippedSoftmax, DI-Exp, DI-Norm, and DI-SwiGLU). The proposed framework demonstrates strong empirical performance across LLaMA and OPT families, achieving competitive or superior accuracy to FP baselines and outperforming non-integer PTQ methods, even at low bit-widths such as W4A4. This integer-only PTQ approach enables efficient deployment of large language models on edge devices without floating-point capabilities, with potential for substantial latency and power savings on suitable hardware, and sets the stage for further hardware-focused optimizations.

Abstract

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

TL;DR

This work tackles the challenge of fully integer-only inference for large language models by addressing activation fluctuations that hamper quantization of both linear and non-linear operations. It introduces three core components—Fully-Smooth Block-Reconstruction (FSBR) to harmonize activations across channels, Dynamic Integer-only MatMul (DI-MatMul) to handle inter-token variation with dynamic integer quantization, and a set of lightweight dynamic non-linear operators (DI-ClippedSoftmax, DI-Exp, DI-Norm, and DI-SwiGLU). The proposed framework demonstrates strong empirical performance across LLaMA and OPT families, achieving competitive or superior accuracy to FP baselines and outperforming non-integer PTQ methods, even at low bit-widths such as W4A4. This integer-only PTQ approach enables efficient deployment of large language models on edge devices without floating-point capabilities, with potential for substantial latency and power savings on suitable hardware, and sets the stage for further hardware-focused optimizations.

Abstract

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.
Paper Structure (18 sections, 14 equations, 6 figures, 5 tables, 4 algorithms)

This paper contains 18 sections, 14 equations, 6 figures, 5 tables, 4 algorithms.

Figures (6)

  • Figure 1: Differences in activations of the non-linear operator in LLaMA2-7b across the channel/token dimensions.
  • Figure 2: The output activation distribution of the gated unit in the SwiGLU before FSBR (a) and after FSBR (b).
  • Figure 3: Typical LLM quantization vs. I-LLM. The former requires dequantization and involves FP arithmetic, while the latter performs the entire inference using integer-only arithmetic.
  • Figure 4: PPL$\downarrow$ of different PTQ methods on LLaMA family using W8A8. Notably, due to the exceptionally high PPL of I-Bert, a dedicated y-axis has been allocated for its representation.
  • Figure 5: Details of I-LLM in a transformer block. The left side of the figure illustrates various paradigms for channel-wise smoothing during the FSBR process. The right side depicts the integer-only execution pipeline for both linear operators, such as matrix multiplication (MatMul), and non-linear operators.
  • ...and 1 more figures