Faster Inference of LLMs using FP8 on the Intel Gaudi
Joonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee
TL;DR
This work provides a detailed account of FP8 quantization for end-to-end LLM inference on Intel Gaudi accelerators, describing the scaled FP8 GEMM mechanism, calibration workflows, and a spectrum of activation/weight scaling strategies. It presents throughput results showing high MFU utilization (often >90%) and an accuracy degradation typically under 1% across tasks, with peak FP8 dense GEMM throughput reaching up to 865 TFLOPS. The study evaluates multiple model families (e.g., Llama2, Llama3, Mistral) and analyses how model scale, task type, and quantization method interact to shape performance. Practically, it offers a set of quantization recipes and calibration procedures that enable efficient FP8 deployment on Gaudi 2/3 for inference workloads, serving as a reference for researchers and practitioners aiming to accelerate LLMs with low-precision arithmetic.
Abstract
Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.
