Table of Contents
Fetching ...

Efficient Post-training Quantization with FP8 Formats

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang

TL;DR

This work demonstrates that post-training quantization using FP8 formats (E5M2, E4M3, E3M4) substantially improves workload coverage and maintains accuracy across a broad spectrum of models (75 networks) and tasks, compared with INT8. By developing a unified FP8 quantization workflow with standard and extended schemes, per-channel weight scaling, per-tensor activations, and optional BatchNorm calibration, the approach achieves higher coverage (e.g., 92.64% for FP8 vs 65.87% for INT8) and favorable accuracy across NLP and CV domains. The study finds that E4M3 is particularly effective for NLP while E3M4 often excels on CV tasks, and demonstrates benefits from mixed FP8 formats and extended operator coverage, including generation quality improvements for image and text tasks. The authors provide publicly available tooling and outline future work to expand FP8 recipes to more LLMs and other domains, enabling practical, high-accuracy low-precision deployment.

Abstract

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

Efficient Post-training Quantization with FP8 Formats

TL;DR

This work demonstrates that post-training quantization using FP8 formats (E5M2, E4M3, E3M4) substantially improves workload coverage and maintains accuracy across a broad spectrum of models (75 networks) and tasks, compared with INT8. By developing a unified FP8 quantization workflow with standard and extended schemes, per-channel weight scaling, per-tensor activations, and optional BatchNorm calibration, the approach achieves higher coverage (e.g., 92.64% for FP8 vs 65.87% for INT8) and favorable accuracy across NLP and CV domains. The study finds that E4M3 is particularly effective for NLP while E3M4 often excels on CV tasks, and demonstrates benefits from mixed FP8 formats and extended operator coverage, including generation quality improvements for image and text tasks. The authors provide publicly available tooling and outline future work to expand FP8 recipes to more LLMs and other domains, enabling practical, high-accuracy low-precision deployment.

Abstract

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.
Paper Structure (19 sections, 4 equations, 12 figures, 7 tables)

This paper contains 19 sections, 4 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: (left) Histogram of the tensor $X \sim \mathcal{N}(\mu=0.0,\,\sigma^{2}=0.5)$, that contains a small number ( 1%) of outliers uniformly distributed between -6.0 to 6.0. (center) Distribution of quantized values represented by E5M2, E4M3, E3M4 and INT8 data formats. (right) Overall quantization error as measured by mean-square-error (MSE).
  • Figure 2: Standard Quantization Scheme: default configuration for broad set of operations across different workloads, Extended Quantization Scheme: configuration for additional operator coverage (Ex: LayerNorm, BatchNorm & element-wise), mixed FP8 formats, dynamic quantization, BatchNorm Calibration: recalibrate mean and variance parameters to recover accuracy lost due to quantization, Range calibration: max scaling, outlier clipping (more discussions in Appendix A.1).
  • Figure 3: Tensor Distributions: (left) activations in NLP workloads contain outliers, hence they are range-bounded, (center) Activation in CV workloads tend to be precision-bounded, (right) Weight tensors from both CV & NLP networks tend to be precision-bounded.
  • Figure 4: Variability in accuracy loss: INT8 shows higher variability for CV models than E4M3 and E3M4 due to its ineffectiveness on models such as EfficientNet, MobileNetV3, and ViT. Quantization-aware training may partially mitigate this issue, but it is out of scope of this paper. E4M3 and E3M4 show better accuracy & less variability with very few outliers compared to INT8.
  • Figure 5: Accuracy Loss by Size on CV (top) and NLP (bottom). The model size is represented by the ball size in the scale of $log10(model\_size)$, where tiny/small/medium/large is defined by the size range in MB $<=32$, $(32, 384]$, $(384, 512]$, and $> 512$ respectively. Note that some points are overlayed due to the similar accuracy (e.g., E4M3 in blue and E3M4 in green on NLP models).
  • ...and 7 more figures