Table of Contents
Fetching ...

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao

TL;DR

OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference, and improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16).

Abstract

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

TL;DR

OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference, and improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16).

Abstract

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.
Paper Structure (30 sections, 8 equations, 4 figures, 10 tables)

This paper contains 30 sections, 8 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Differences in different quantization dimensions. The quantization settings in Fig. \ref{['fig1']}a,\ref{['fig1']}b, tensor-tensor and token-channel, can perform the efficient external scaling operation, but result in high quantization loss. The channel-channel quantization in Fig. \ref{['fig1']}c only causes minimal quantization loss ($< 0.1$) but lacks computational efficiency. We pre-execute the dequantization operation in the channel-channel quantization through weight updating to ensure computational efficiency, as shown in Fig. \ref{['fig1']}d. The perplexity of WikiText2 is measured with OPT-6.7B under INT8 quantization.
  • Figure 2: Comparison of quantization flows of multi-head attention (MHA) and feed-forward network (FFN) under different quantization settings. OutlierTune avoids the additional computational overheads caused by the internal scaling required for per-channel quantization (right), thereby achieving similar efficiency as per-tensor or per-token quantization (left).
  • Figure 3: Real latency and memory footprint of our proposed OutlierTune and Baseline. The batch size and sequence length are set to 1 and 512, respectively.
  • Figure 4: Real latency of OutlierTune over different batch sizes on OPT Models. The sequence length is set to 128.