Table of Contents
Fetching ...

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

TL;DR

This study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales, providing a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.

Abstract

Quantization addresses the high resource demand for large language models (LLMs) by alleviating memory pressure and bandwidth congestion and providing significantly scaled compute power with a tolerable impact on accuracy. Four-bit floating point (FP4), the lowest-precision format that preserves essential numerical properties such as exponent and sign, has begun to be adopted in cutting-edge architectures, including Blackwell and AMD CDNA, to support LLM quantization and reduce deployment costs. Although aggressive quantization can yield efficiency gains, the quantization sensitivity of within-transformer layers and whether these sensitivities generalize across existing FP4 formats and model scales remain underexplored. To elucidate quantization sensitivity, this study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales (0.5B, 7B, and 14B), using controlled component-wise and block-wise isolation methodologies. We observe that MLP up- and down-projection layers consistently dominate in terms of sensitivity, while gate and attention projections are moderately and substantially less sensitive to FP4 quantization, respectively. We further find that sensitivity does not universally localize to the final blocks, but early blocks can be highly sensitive, particularly under MXFP4. Our results provide a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

TL;DR

This study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales, providing a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.

Abstract

Quantization addresses the high resource demand for large language models (LLMs) by alleviating memory pressure and bandwidth congestion and providing significantly scaled compute power with a tolerable impact on accuracy. Four-bit floating point (FP4), the lowest-precision format that preserves essential numerical properties such as exponent and sign, has begun to be adopted in cutting-edge architectures, including Blackwell and AMD CDNA, to support LLM quantization and reduce deployment costs. Although aggressive quantization can yield efficiency gains, the quantization sensitivity of within-transformer layers and whether these sensitivities generalize across existing FP4 formats and model scales remain underexplored. To elucidate quantization sensitivity, this study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales (0.5B, 7B, and 14B), using controlled component-wise and block-wise isolation methodologies. We observe that MLP up- and down-projection layers consistently dominate in terms of sensitivity, while gate and attention projections are moderately and substantially less sensitive to FP4 quantization, respectively. We further find that sensitivity does not universally localize to the final blocks, but early blocks can be highly sensitive, particularly under MXFP4. Our results provide a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.
Paper Structure (13 sections, 32 figures, 4 tables)

This paper contains 13 sections, 32 figures, 4 tables.

Figures (32)

  • Figure 1: Component sensitivity comparison across three model scales. Rows: 0.5B, 7B, 14B. Blue = MLP, Red = Attention. MLP projections (down and up) consistently form the most sensitive tier across all scales and formats.
  • Figure 2: Block-wise sensitivity analysis for down projection across three model scales. Y-axis uses symlog scale. Positive values indicate PPL improvement when keeping that block in FP16.
  • Figure 3: Block sensitivity heatmaps showing PPL improvement when each block's component is kept in FP16. Left: MXFP4 (scale 0--2.0). Right: NVFP4 (scale 0--0.28). Note the 7$\times$ scale difference and different spatial patterns.
  • Figure 4: Block sensitivity for up_proj. MXFP4 shows strong early-block sensitivity (blocks 0, 8, 23), while NVFP4 peaks at block 23.
  • Figure 5: Percentage change from baseline for up_proj. Block 23 shows $-3.3\%$ (MXFP4) and $-1.3\%$ (NVFP4) improvement.
  • ...and 27 more figures