Table of Contents
Fetching ...

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh

TL;DR

This work tackles the challenge of end-to-end FP4 training for large language models by leveraging hardware-native MXFP4 on NVIDIA Blackwell GPUs. It introduces Quartet, a four-ingredient framework that couples scaling-law analysis with a forward-minimizing and unbiased-backward approach, implemented via highly optimized CUDA/CUTLASS kernels. Across Llama-style pretraining tasks, Quartet achieves superior accuracy and speed compared with prior FP4/INT4 methods and can outperform FP8 baselines under realistic compute budgets. The work demonstrates that MXFP4 can be effectively and efficiently used for large-scale pre-training, offering substantial reductions in compute and energy costs while preserving model quality. Practical impact includes enabling cheaper, faster LLM pretraining and providing open-source tools to reproduce and extend fully quantized training on next-generation hardware.

Abstract

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

TL;DR

This work tackles the challenge of end-to-end FP4 training for large language models by leveraging hardware-native MXFP4 on NVIDIA Blackwell GPUs. It introduces Quartet, a four-ingredient framework that couples scaling-law analysis with a forward-minimizing and unbiased-backward approach, implemented via highly optimized CUDA/CUTLASS kernels. Across Llama-style pretraining tasks, Quartet achieves superior accuracy and speed compared with prior FP4/INT4 methods and can outperform FP8 baselines under realistic compute budgets. The work demonstrates that MXFP4 can be effectively and efficiently used for large-scale pre-training, offering substantial reductions in compute and energy costs while preserving model quality. Practical impact includes enabling cheaper, faster LLM pretraining and providing open-source tools to reproduce and extend fully quantized training on next-generation hardware.

Abstract

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Paper Structure

This paper contains 41 sections, 4 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Analysis of Quartet: (a) Scaling-law \ref{['eq:main_scaling_law']} fit for various FORWARD:BACKWARD precisions. (b) Regions where each FORWARD:BACKWARD precision is optimal under the BOPS speedup model. (c) Same as (b) but with RTX 5090 speedups. Interestingly, popular models such as larger Llama3 or Qwen2.5 models fall into the FP4:FP4 optimality region, implying that training similar models in FP4 might have been optimal.
  • Figure 2: The effect of backward pass quantization on LLM training gradient quality and impact on performance: (a, left) and (b, middle) shows cosine similarity and projection magnitude misalignment with unquantized reference, while (c, right) shows performance gaps with a non-quantized baseline for a set model sizes and data-to-parameter ratios (D/N).
  • Figure 3: (a, left), (b, middle): Quartet kernels block-wise speedup across model sizes relative to FP8 and BF16. (c, right): Training dynamics for the 7B model trained with Quartet relative to FP8 .
  • Figure 4: Correspondence between validation loss on C4 and various few-shot benchmarks for Llama models with 30-200M parameters.
  • Figure 5: Comparison of various scaling law fits and their errors.
  • ...and 2 more figures