Table of Contents
Fetching ...

FP4 All the Way: Fully Quantized Training of LLMs

Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry

TL;DR

<3-5 sentence high-level summary> This paper advances fully quantized training for large language models by demonstrating FP4 training across weights, activations, and gradients using NVFP4 on a 7B-parameter model. It systematically characterizes FP4 design choices (block size, scale formats) and rounding schemes, and provides a theoretical and empirical gradient-noise threshold to decide when to switch to higher precision. The work shows FP4 can match BF16 performance on downstream tasks after a brief quantization-aware fine-tuning (QAF) phase, highlighting substantial potential for compute and memory savings. While currently hardware-supported FP4 execution is not available on Gaudi devices, the study offers a practical roadmap and code reference for realizing end-to-end FP4 training at scale.

Abstract

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

FP4 All the Way: Fully Quantized Training of LLMs

TL;DR

<3-5 sentence high-level summary> This paper advances fully quantized training for large language models by demonstrating FP4 training across weights, activations, and gradients using NVFP4 on a 7B-parameter model. It systematically characterizes FP4 design choices (block size, scale formats) and rounding schemes, and provides a theoretical and empirical gradient-noise threshold to decide when to switch to higher precision. The work shows FP4 can match BF16 performance on downstream tasks after a brief quantization-aware fine-tuning (QAF) phase, highlighting substantial potential for compute and memory savings. While currently hardware-supported FP4 execution is not available on Gaudi devices, the study offers a practical roadmap and code reference for realizing end-to-end FP4 training at scale.

Abstract

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

Paper Structure

This paper contains 29 sections, 53 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Formats E4M3 (used in NVFP4) and E3M4 achieved the best results. Comparison of different scaling formats (E1M6, E2M5, E3M4, E4M3, E5M2, E6M1, E8M0) when training a 350M Llama model using FP4 format (E2M1) with block size 16. The formats E3M4 and E4M3 achieve the best results (recall E4M3 is used in NVFP4), whereas E1M6 results in complete divergence.
  • Figure 2: Block size 16 is the best option. We examine the impact of different block sizes (8, 16, 32, 64, 128) on training accuracy using scaling formats: (a) E8M0 (used in MXFP4) and (b) E4M3 (used in NVFP4). Smaller block sizes yield modest improvements in accuracy, with diminishing returns below 16 elements per block. Thus, a block size of 16 provides an optimal compromise between performance and computational overhead.
  • Figure 3: Comparison of different rounding schemes when training a 350M Llama model using NVFP4 format. In each graph, we apply SR in one of the six elements in one of the GEMMs while the rest use round-to-nearest (RtN). Notice that applying SR to neural gradients during both 'Update' and 'Backward' GEMMs and activations during the 'Update' GEMM leads to lower training loss, while applying SR to other components has the opposite effect, increasing the loss.
  • Figure 4: Validation of theoretical derivation in a simple quadratic loss. Training loss with noise levels $\sigma_q = k \cdot \sigma_{\mathrm{crit}}$ for $k = 2, 1, 0.5$ in a toy quadratic model. High noise blocks descent; low noise allows continued progress.
  • Figure 5: Validation theoretical prediction in a Llama 60M model.(Left): The difference between the loss curve of the baseline and a model with increasing precision mid-training (1000th iteration, vertical dashed orange line). After increasing the precision, the loss difference is completely reduced. (Right): Gradient-to-noise ratio with the $\sqrt{3}$ threshold (black dashed line).
  • ...and 2 more figures