Table of Contents
Fetching ...

Block Rotation is All You Need for MXFP4 Quantization

Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng

TL;DR

This work benchmarks post-training quantization methods for MXFP4-based $W4A4$ quantization, revealing a fundamental incompatibility between global rotation and MXFP4's PoT block scaling. It analyzes why rotation fails under MXFP4—large-value reconstruction is poor within a PoT-based block, while rotation amplifies small-value blocks— and proposes Block Rotation Quantization (BRQ), a block-wise rotation strategy that preserves block scales and confines energy redistribution. BRQ, combined with GPTQ, yields substantial accuracy gains across multiple LLMs and reduces online rotation costs, outperforming existing rotation-based methods and often beating per-block FP16 baselines. The results offer practical guidance for deploying LLMs on MXFP4 hardware and lay groundwork for future PTQ improvements under low-precision formats.

Abstract

Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

Block Rotation is All You Need for MXFP4 Quantization

TL;DR

This work benchmarks post-training quantization methods for MXFP4-based quantization, revealing a fundamental incompatibility between global rotation and MXFP4's PoT block scaling. It analyzes why rotation fails under MXFP4—large-value reconstruction is poor within a PoT-based block, while rotation amplifies small-value blocks— and proposes Block Rotation Quantization (BRQ), a block-wise rotation strategy that preserves block scales and confines energy redistribution. BRQ, combined with GPTQ, yields substantial accuracy gains across multiple LLMs and reduces online rotation costs, outperforming existing rotation-based methods and often beating per-block FP16 baselines. The results offer practical guidance for deploying LLMs on MXFP4 hardware and lay groundwork for future PTQ improvements under low-precision formats.

Abstract

Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

Paper Structure

This paper contains 26 sections, 1 equation, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Overall performance of quantization methods under MXFP4. The x-axis shows perplexity, the y-axis shows average downstream accuracy, and methods nearer the top-left are closer to the FP16 baseline, indicating better performance.
  • Figure 2: Effect of rotation and its variants across different quantization formats. Rot applies a random Hadamard transform with RTN; Rot+GPTQ combines the transform with GPTQ; and Opt. Rot+GPTQ employs an optimized rotation matrix with GPTQ.
  • Figure 3: (a) illustrates the rounding error curve of PoT format. (b) and (c) show the quantization error of MXFP4 relative to BFP4 for regular and outlier blocks, respectively. Bar charts represent the original activation values (right axis), lines indicate the relative quantization error (left axis).
  • Figure 4: Comparison of the distribution of Llama-3 8B activation after different transformations. More block-scale visualizations are provided in Appendix \ref{['apx:rot']}.
  • Figure 5: The effect of rotation transformation on activation distribution. The horizontal axis represents the segmentation threshold, and the vertical axis represents the percentage of data greater than the threshold.
  • ...and 5 more figures