Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Jatin Chhugani; Geonhwa Jeong; Bor-Yiing Su; Yunjie Pan; Hanmei Yang; Aayush Ankit; Jiecao Yu; Summer Deng; Yunqing Chen; Nadathur Satish; Changkyu Kim

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Jatin Chhugani, Geonhwa Jeong, Bor-Yiing Su, Yunjie Pan, Hanmei Yang, Aayush Ankit, Jiecao Yu, Summer Deng, Yunqing Chen, Nadathur Satish, Changkyu Kim

TL;DR

This work introduces two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes, and re-establish MXFP4 as a practical alternative to NVFP4.

Abstract

Large Language Models (LLMs) have intensified the need for low-precision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA's NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest GEMM overhead (6.2% on average). These results re-establish MXFP4 as a practical alternative to NVFP4, enabling near-NVFP4 accuracy while retaining MX's hardware-efficiency advantages (e.g., 12% relative area savings in tensor cores).

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

TL;DR

Abstract

Paper Structure (31 sections, 7 equations, 8 figures, 6 tables)

This paper contains 31 sections, 7 equations, 8 figures, 6 tables.

Introduction
Background
Transformer
Quantization with FP4
HW for MXFP4 GEMM and NVFP4 GEMM
Understanding NVFP4 vs. MXFP4
Analysis Methodology
Implications of Fine-Grained Block Quantization: (32 $\to$ 16)
Impact of Fine-Grained Scaling Factor Format: E8M0 $\to$ E4M3
Impact on HW Cost of Block Size and Scaling Factor Format
Proposed Direction
Enhancing MX Format
Quantization Block Granularity
Overflow-Aware Scaling (OAS)
Macro Block Scaling (MBS)
...and 16 more sections

Figures (8)

Figure 1: Modern LLM Model Architecture.
Figure 2: Comparison of different FP4 formats for quantization.
Figure 3: Hardware architecture of Tensor Core.
Figure 4: Overview of the DPU architecture.
Figure 5: Matrix multiplication $AB^T$ with MBS.
...and 3 more figures

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

TL;DR

Abstract

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)