Table of Contents
Fetching ...

Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs

Jaewoo Yang, Hayun Kim, Younghoon Kim

TL;DR

The paper identifies activation spikes in GLU-based FFNs as a key obstacle to activation quantization in PTQ for modern LLMs. It proposes two calibration-based, model-agnostic solutions, QFeM and QFeP, to isolate and neutralize spikes without retraining. Across multiple GLU implementations (e.g., LLaMA-2/3, Mistral, Mixtral, SOLAR, Gemma), these methods restore performance toward FP16 and improve compatibility with coarse-grained quantization, often outperforming existing outlier mitigation alone. The work offers a practical path to faster, memory-efficient INT8 inference for GLU-heavy LLMs and demonstrates tangible latency/memory benefits with minimal overhead.

Abstract

Modern large language models (LLMs) have established state-of-the-art performance through architectural improvements, but still require significant computational cost for inference. In an effort to reduce the inference cost, post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision, such as INT8. In this paper, we reveal the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern LLMs, such as LLaMA family. The problem is that severe local quantization errors, caused by excessive magnitudes of activation in GLU variants, significantly degrade the performance of the quantized LLM. We denote these activations as activation spikes. Our further observations provide a systematic pattern of activation spikes: 1) The activation spikes occur in the FFN of specific layers, particularly in the early and late layers, 2) The activation spikes are dedicated to a couple of tokens, rather than being shared across a sequence. Based on our observations, we propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization. Our extensive experiments validate the effectiveness of the proposed methods for the activation quantization, especially with coarse-grained scheme, of latest LLMs with GLU variants, including LLaMA-2/3, Mistral, Mixtral, SOLAR, and Gemma. In particular, our methods enhance the current alleviation techniques (e.g., SmoothQuant) that fail to control the activation spikes. Code is available at https://github.com/onnoo/activation-spikes.

Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs

TL;DR

The paper identifies activation spikes in GLU-based FFNs as a key obstacle to activation quantization in PTQ for modern LLMs. It proposes two calibration-based, model-agnostic solutions, QFeM and QFeP, to isolate and neutralize spikes without retraining. Across multiple GLU implementations (e.g., LLaMA-2/3, Mistral, Mixtral, SOLAR, Gemma), these methods restore performance toward FP16 and improve compatibility with coarse-grained quantization, often outperforming existing outlier mitigation alone. The work offers a practical path to faster, memory-efficient INT8 inference for GLU-heavy LLMs and demonstrates tangible latency/memory benefits with minimal overhead.

Abstract

Modern large language models (LLMs) have established state-of-the-art performance through architectural improvements, but still require significant computational cost for inference. In an effort to reduce the inference cost, post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision, such as INT8. In this paper, we reveal the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern LLMs, such as LLaMA family. The problem is that severe local quantization errors, caused by excessive magnitudes of activation in GLU variants, significantly degrade the performance of the quantized LLM. We denote these activations as activation spikes. Our further observations provide a systematic pattern of activation spikes: 1) The activation spikes occur in the FFN of specific layers, particularly in the early and late layers, 2) The activation spikes are dedicated to a couple of tokens, rather than being shared across a sequence. Based on our observations, we propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization. Our extensive experiments validate the effectiveness of the proposed methods for the activation quantization, especially with coarse-grained scheme, of latest LLMs with GLU variants, including LLaMA-2/3, Mistral, Mixtral, SOLAR, and Gemma. In particular, our methods enhance the current alleviation techniques (e.g., SmoothQuant) that fail to control the activation spikes. Code is available at https://github.com/onnoo/activation-spikes.
Paper Structure (40 sections, 13 figures, 9 tables)

This paper contains 40 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Calibration results on GLU-implemented and non GLU-implemented LLMs. We present the maximum magnitudes of input activations for each linear modules and layer-wise hidden states. For more results on different LLMs, see Appendix \ref{['appsec:a.2']}, \ref{['appsec:a.3']}.
  • Figure 2: Token-wise scales in a specific layer with an activation spike. When quantizing the input activations using a per-tensor scale, the scale of the activation spike dominates the scales of the other tokens. For more examples, see Appendix \ref{['appsec:d.2']}.
  • Figure 3: Overview of QFeM and QFeP. (Left): QFeM excludes the modules whose $r^{(m)}$ is larger than the hyperparameter $\alpha$ from quantization. (Right): QFeP computes in advance the prefix of activation spikes and utilizes solely their KV cache during the quantization phase, effectively preventing further activation spikes in subsequent sequences.
  • Figure 4: Trade-off between perplexity (stands for performance) and $|M_{unq}|$ (stands for latency) according to the threshold $\alpha$ for LLaMA-2-13B model.
  • Figure 5: The average accuracy of zero-shot evaluation on other GLU-implemented LLMs. Most models recover significantly compared to W8A8, with performance close to FP16.
  • ...and 8 more figures