Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Roberto L. Castro; Andrei Panferov; Soroush Tabesh; Oliver Sieberling; Jiale Chen; Mahdi Nikdan; Saleh Ashkboos; Dan Alistarh

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh

TL;DR

This work tackles the challenge of end-to-end FP4 training for large language models by leveraging hardware-native MXFP4 on NVIDIA Blackwell GPUs. It introduces Quartet, a four-ingredient framework that couples scaling-law analysis with a forward-minimizing and unbiased-backward approach, implemented via highly optimized CUDA/CUTLASS kernels. Across Llama-style pretraining tasks, Quartet achieves superior accuracy and speed compared with prior FP4/INT4 methods and can outperform FP8 baselines under realistic compute budgets. The work demonstrates that MXFP4 can be effectively and efficiently used for large-scale pre-training, offering substantial reductions in compute and energy costs while preserving model quality. Practical impact includes enabling cheaper, faster LLM pretraining and providing open-source tools to reproduce and extend fully quantized training on next-generation hardware.

Abstract

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

TL;DR

Abstract

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)