Table of Contents
Fetching ...

FP8 versus INT8 for efficient deep learning inference

Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga, Tijmen Blankevoort

TL;DR

The paper evaluates FP8 versus INT8 for on-device neural network inference, combining theoretical analysis of number formats with extensive PTQ and QAT experiments across diverse models. It demonstrates that FP8 formats (especially FP8-E4/E5) incur substantial hardware costs and, for most well-behaved networks, do not yield better accuracy than INT8; transformer-specific outlier cases can momentarily favor FP8, but hardware inefficiencies and available quantization tricks render FP8 unnecessary. The authors argue that INT8 (and occasionally INT4 with W8A16) provides superior practical efficiency, with FP8 primarily suited for training rather than inference. They also show that FP8-to-INT8 conversions can preserve or improve accuracy in many cases, reinforcing INT8 as the robust path for edge deployment and suggesting that FP8-based inference hardware is unlikely to offer net benefits. Overall, the work aligns with prior findings but offers a broader, hardware-aware comparison that cautions against FP8 adoption for inference.

Abstract

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

FP8 versus INT8 for efficient deep learning inference

TL;DR

The paper evaluates FP8 versus INT8 for on-device neural network inference, combining theoretical analysis of number formats with extensive PTQ and QAT experiments across diverse models. It demonstrates that FP8 formats (especially FP8-E4/E5) incur substantial hardware costs and, for most well-behaved networks, do not yield better accuracy than INT8; transformer-specific outlier cases can momentarily favor FP8, but hardware inefficiencies and available quantization tricks render FP8 unnecessary. The authors argue that INT8 (and occasionally INT4 with W8A16) provides superior practical efficiency, with FP8 primarily suited for training rather than inference. They also show that FP8-to-INT8 conversions can preserve or improve accuracy in many cases, reinforcing INT8 as the robust path for edge deployment and suggesting that FP8-based inference hardware is unlikely to offer net benefits. Overall, the work aligns with prior findings but offers a broader, hardware-aware comparison that cautions against FP8 adoption for inference.

Abstract

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.
Paper Structure (42 sections, 3 equations, 10 figures, 6 tables)

This paper contains 42 sections, 3 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: A schematic overview of a deep learning accelerator. Weights and activations are stored in memory and moved to the MatMul calculation unit. The bit-width matters for both latency and energy consumed for transferring the data. The calculation unit does a matrix multiplication; here, both the bit-width and the format matter for latency and energy consumption. The accumulator stores the intermediate results, for which the format/bit-width can be chosen. Finally, the output format also has a choice, for which the bit-width dictates how many bits are transferred and stored back in memory.
  • Figure 2: A schematic overview of the components that go in a multiply-accumulate unit in silicon. (Left) is the picture for a fixed-point (Kulisch) accumulator, (Right) for a floating-point accumulator. The dark blue parts represent the logic necessary for the multiplication itself. The grey area is needed for accumulation, and the light blue/green aligns/adds a product to the accumulator.
  • Figure 3: A count of the number of 2-input gates necessary in hardware to implement each format and accumulator combination. From left to right, INT8 with increasing exponent bits until FP8-E5. In each group of three, the first bar is for a 15+12=27-bit fixed-point accumulator. The second bar indicates the numbers for FP16 accumulation, and the third bar is the result of an FP32 accumulator. We can see that FP8-E4->16-bit requires 53% more gates, and FP8-E4->32-bit requires 183% more gates for an implementation in hardware.
  • Figure 4: An example of the floating point format for unsigned 4 bits. When the scaling factor $b$ is flexible, the integer and floating-point format can occupy the same range of representable values. The only difference is in their underlying distributions, where the floating-point format can capture either values closer to $0$ more accurately or represent outliers better. This comes at the cost of the accuracy of the number representation in the other region.
  • Figure 5: Here we plot, for several distributions, 'bits of accuracy': inverted and normalized RMSE. More bits of accuracy is better. For the uniform distribution, INT8 is the best. For normal distributions, FP8-E2 is optimal, with INT8 as a close second. Many distributions in neural networks are Normally distributed, meaning results on this distribution is a very relevant indicator of performance. Only when outliers enter the picture do formats with more exponent bits start giving a better result. The optimal quantizer is the best you could get for these distributions based on the Lloyd-Max quantizer.
  • ...and 5 more figures