Table of Contents
Fetching ...

TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, Jianxin Wu

TL;DR

This work identifies extreme activation outliers in FP8 Transformer training as a data-independent mechanical artifact of weight structure and training dynamics. It introduces TWEO, a simple loss that penalizes large activation magnitudes across Transformer blocks, achieving stable FP8 pre-training across vision and language and enabling efficient, per-tensor quantization of residuals for PTQ. TWEO eliminates the need for complex engineering tricks or architectural changes, delivering substantial throughput gains and unlocking a new quantization paradigm. The findings suggest broad applicability to hardware-efficient training and quantization, with potential impact on accelerator design and low-bit AI deployment.

Abstract

Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

TL;DR

This work identifies extreme activation outliers in FP8 Transformer training as a data-independent mechanical artifact of weight structure and training dynamics. It introduces TWEO, a simple loss that penalizes large activation magnitudes across Transformer blocks, achieving stable FP8 pre-training across vision and language and enabling efficient, per-tensor quantization of residuals for PTQ. TWEO eliminates the need for complex engineering tricks or architectural changes, delivering substantial throughput gains and unlocking a new quantization paradigm. The findings suggest broad applicability to hardware-efficient training and quantization, with potential impact on accelerator design and low-bit AI deployment.

Abstract

Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

Paper Structure

This paper contains 13 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Empirical evidences for data-dependent vs. data-independent assumptions. (a) Running real data through a pre-trained Qwen-0.5B LLM produces extreme outliers ($>$1000) in most layers. (b) But, real data through a randomly initialized Qwen2.5-0.5B yields minimal activation magnitudes ($<$10). (c) And, changing input to random Gaussian noise, the same pre-trained Qwen2.5-0.5B still produces extreme outliers.
  • Figure 2: Activation magnitudes in GPT-2 Medium BF16 training.
  • Figure 3: Activation magnitudes in GPT-2 Medium FP8 training.
  • Figure 4: Activation value distributions for various GPT-2 model sizes, with and without TWEO.
  • Figure 5: Comparison of activation magnitudes during GPT-2 Large (774M) training. Top 1 refers to the largest activation value in each layer. Top 2, Top 3 refer to the second and third largest. Median is the median of all activations.