Block Rotation is All You Need for MXFP4 Quantization

Yuantian Shao; Peisong Wang; Yuanteng Chen; Chang Xu; Zhihui Wei; Jian Cheng

Block Rotation is All You Need for MXFP4 Quantization

Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng

TL;DR

This work benchmarks post-training quantization methods for MXFP4-based $W4A4$ quantization, revealing a fundamental incompatibility between global rotation and MXFP4's PoT block scaling. It analyzes why rotation fails under MXFP4—large-value reconstruction is poor within a PoT-based block, while rotation amplifies small-value blocks— and proposes Block Rotation Quantization (BRQ), a block-wise rotation strategy that preserves block scales and confines energy redistribution. BRQ, combined with GPTQ, yields substantial accuracy gains across multiple LLMs and reduces online rotation costs, outperforming existing rotation-based methods and often beating per-block FP16 baselines. The results offer practical guidance for deploying LLMs on MXFP4 hardware and lay groundwork for future PTQ improvements under low-precision formats.

Abstract

Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

Block Rotation is All You Need for MXFP4 Quantization

TL;DR

Abstract

Block Rotation is All You Need for MXFP4 Quantization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)