Table of Contents
Fetching ...

Faster Inference of LLMs using FP8 on the Intel Gaudi

Joonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee

TL;DR

This work provides a detailed account of FP8 quantization for end-to-end LLM inference on Intel Gaudi accelerators, describing the scaled FP8 GEMM mechanism, calibration workflows, and a spectrum of activation/weight scaling strategies. It presents throughput results showing high MFU utilization (often >90%) and an accuracy degradation typically under 1% across tasks, with peak FP8 dense GEMM throughput reaching up to 865 TFLOPS. The study evaluates multiple model families (e.g., Llama2, Llama3, Mistral) and analyses how model scale, task type, and quantization method interact to shape performance. Practically, it offers a set of quantization recipes and calibration procedures that enable efficient FP8 deployment on Gaudi 2/3 for inference workloads, serving as a reference for researchers and practitioners aiming to accelerate LLMs with low-precision arithmetic.

Abstract

Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.

Faster Inference of LLMs using FP8 on the Intel Gaudi

TL;DR

This work provides a detailed account of FP8 quantization for end-to-end LLM inference on Intel Gaudi accelerators, describing the scaled FP8 GEMM mechanism, calibration workflows, and a spectrum of activation/weight scaling strategies. It presents throughput results showing high MFU utilization (often >90%) and an accuracy degradation typically under 1% across tasks, with peak FP8 dense GEMM throughput reaching up to 865 TFLOPS. The study evaluates multiple model families (e.g., Llama2, Llama3, Mistral) and analyses how model scale, task type, and quantization method interact to shape performance. Practically, it offers a set of quantization recipes and calibration procedures that enable efficient FP8 deployment on Gaudi 2/3 for inference workloads, serving as a reference for researchers and practitioners aiming to accelerate LLMs with low-precision arithmetic.

Abstract

Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.

Paper Structure

This paper contains 29 sections, 30 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Activation quantization block-diagram. The activation undergoes either per-sample or per-tensor scaling. Per-sample scaling is impractical for static scaling of activations because per-token information is unknown during calibration. Per-sample scaling would also require a fixed number of samples.
  • Figure 2: Weight quantization block-diagram. The high-precision weight undergoes either per-output-channel or per-tensor scaling. Weight quantization is static for inference.
  • Figure 3: Descaling of the scaled FP8 GEMM operation. Scaling factors are multiplied with one another, then applied to the GEMM results.