Table of Contents
Fetching ...

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

TL;DR

The paper investigates why transformer activations develop outliers that hinder activation quantization. It reveals that attention heads learn no-update or partial-update patterns, driven by softmax dynamics interacting with residual connections and LayerNorm. To address this, it introduces two architectural tweaks—clipped softmax and gated attention—that enable exact zeros or controlled updates, reducing outliers while preserving or improving FP performance. Through experiments on BERT, OPT, and ViT, the authors demonstrate improved 8-bit post-training quantization without extensive retraining, including scalability to larger models. These methods offer a practical path to more quantization-friendly transformers and energy-efficient inference, especially for edge deployment.

Abstract

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

TL;DR

The paper investigates why transformer activations develop outliers that hinder activation quantization. It reveals that attention heads learn no-update or partial-update patterns, driven by softmax dynamics interacting with residual connections and LayerNorm. To address this, it introduces two architectural tweaks—clipped softmax and gated attention—that enable exact zeros or controlled updates, reducing outliers while preserving or improving FP performance. Through experiments on BERT, OPT, and ViT, the authors demonstrate improved 8-bit post-training quantization without extensive retraining, including scalability to larger models. These methods offer a practical path to more quantization-friendly transformers and energy-efficient inference, especially for edge deployment.

Abstract

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.
Paper Structure (49 sections, 4 equations, 17 figures, 11 tables)

This paper contains 49 sections, 4 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Histograms of outlier counts vs. token positions (blue) and hidden dimensions (green), recorded from the MNLI-m validation set on BERT-base. We use zero-based indexing for dimensions.
  • Figure 2: Visualization of the patterns in the self-attention, specifically the attention probabilities, values, and their product (left, middle and right columns, respectively), in attention head #3 for BERT-base, computed on several data sequences from MNLI-m validation set.
  • Figure 3: A summary of our outlier analysis for ViT demonstrated on a random image from ImageNet validation set. \ref{['fig:03_vit_attention_image']} An input image. \ref{['fig:03_vit_attention_outliers']} Outliers in the output of layer #11. \ref{['fig:03_vit_attention_weights']} Cumulative attention weight spent on every patch (matrix of attention probabilities summed over rows) in the attention head #1, layer #12. \ref{['fig:03_vit_attention_matrix']} A corresponding matrix of attention probabilities. \ref{['fig:03_vit_attention_values']} An average magnitude of values for outlier and non-outlier patches.
  • Figure 4: A schematic illustration of the attention layer in BERT. Hidden activation tensor is denoted by $\mathbf{x}$. $\oplus$ is an element-wise addition. A problematic output of the FFN that generates largest in magnitude outliers is highlighted in red. Notice how those outliers in the previous layer influence the behavior in the attention mechanism in the next layer.
  • Figure 5: A schematic illustration of our proposed gated attention.
  • ...and 12 more figures