Table of Contents
Fetching ...

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

TL;DR

This work critically assesses the practicality of microscaling FP4 formats for LLM inference, revealing that MXFP4 and NVFP4 are not automatically superior to existing INT4 approaches. It develops MR-GPTQ, a FP4-focused variant of GPTQ that leverages block-wise Hadamard rotations, MSE-optimized grids, and static activation reordering, combined with QuTLASS kernels to realize low-overhead rotations on Blackwell GPUs. The authors provide extensive empirical evidence showing MR-GPTQ substantially closes the accuracy gap, with MXFP4 near NVFP4 in many cases, and deliver impressive layer-wise and end-to-end speedups (e.g., up to 3.6x/2.2x on B200 and RTX5090 respectively). They also demonstrate that MXFP4 can be enhanced via scale-fitting and that the overall FP4 landscape benefits from format-specialized algorithms. Overall, the paper argues that FP4 enables a new accuracy-performace frontier when FP4-specific methods are employed, rather than representing a universal upgrade over INT4.

Abstract

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

TL;DR

This work critically assesses the practicality of microscaling FP4 formats for LLM inference, revealing that MXFP4 and NVFP4 are not automatically superior to existing INT4 approaches. It develops MR-GPTQ, a FP4-focused variant of GPTQ that leverages block-wise Hadamard rotations, MSE-optimized grids, and static activation reordering, combined with QuTLASS kernels to realize low-overhead rotations on Blackwell GPUs. The authors provide extensive empirical evidence showing MR-GPTQ substantially closes the accuracy gap, with MXFP4 near NVFP4 in many cases, and deliver impressive layer-wise and end-to-end speedups (e.g., up to 3.6x/2.2x on B200 and RTX5090 respectively). They also demonstrate that MXFP4 can be enhanced via scale-fitting and that the overall FP4 landscape benefits from format-specialized algorithms. Overall, the paper argues that FP4 enables a new accuracy-performace frontier when FP4-specific methods are employed, rather than representing a universal upgrade over INT4.

Abstract

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Paper Structure

This paper contains 49 sections, 4 theorems, 18 equations, 14 figures, 14 tables.

Key Result

Lemma 1

Assume a vector $x\in\mathbb{R}^G$ with coordinates i.i.d. $\mathcal{N}(0,1)$, to which we apply a Hadamard rotation, perform MFP quantization in the $y$-domain to produce $\widehat{y}$, and reconstruct $\widehat{x}=\tfrac{1}{\sqrt{G}}H^\top\widehat{y}$. Define the quantization error vectors $\varep

Figures (14)

  • Figure 1: Schematic illustration of the MXFP4 (left) and NVFP4 (right) microscaling formats.
  • Figure 2: Distribution fits for aggregate weights and activations of Llama-3.1-8B-Instruct, with and without rotations. The Normal distribution is clearly a good fit for rotated weights and activations, while the Laplace distribution provides a good fit for the native distributions. Although native weights appear Normal, they have much heavier tails, as evidenced by the Kurtosis value.
  • Figure 3: The effect of Hadamard Transform (HT) on MXFP4 (E8M0) and NVFP4 (E4M3) quantization on Laplace distribution samples and Llama-3.1-8B-Instruct weights and activations for various group sizes.
  • Figure 4: Ranges of FP8 scale format and observed weight and activation magnitudes.
  • Figure 5: Recoveries with real quantization.
  • ...and 9 more figures

Theorems & Definitions (12)

  • Definition 1: Modeling
  • Definition 2: Scales
  • Definition 3: Quantization Metrics
  • Remark 1: Quantization Dead-zone
  • Lemma 1: Top-Element MSE
  • Remark 2: Outlier preservation
  • Lemma 2: Rates
  • Definition 4: Relative Metrics
  • Lemma 3: Outliers MAPE
  • proof
  • ...and 2 more