Table of Contents
Fetching ...

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo

TL;DR

The paper investigates the trade-offs between low-bit integer and floating-point quantization across fine-grained block sizes for large language models, revealing a crossover where FP dominates coarse quantization but INT becomes advantageous as granularity tightens. It introduces a theoretical QSNR framework that yields explicit formulas for INT and FP under UE8M0 and NV/E4M3 scales, linking crest factor, block size, and scale overhead to performance. Empirical results across tensor-level analysis, direct-cast inference on 12 models, and 8-bit training show MXINT8 often surpasses MXFP8 in both accuracy and hardware efficiency, with Hadamard rotation enabling NVINT4 to outperform NVFP4 in some cases. Hardware cost modeling further demonstrates substantial energy and area savings for fine-grained INT formats, arguing for a shift in hardware design toward refined INT quantization to balance accuracy, power, and efficiency in future accelerators.

Abstract

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

TL;DR

The paper investigates the trade-offs between low-bit integer and floating-point quantization across fine-grained block sizes for large language models, revealing a crossover where FP dominates coarse quantization but INT becomes advantageous as granularity tightens. It introduces a theoretical QSNR framework that yields explicit formulas for INT and FP under UE8M0 and NV/E4M3 scales, linking crest factor, block size, and scale overhead to performance. Empirical results across tensor-level analysis, direct-cast inference on 12 models, and 8-bit training show MXINT8 often surpasses MXFP8 in both accuracy and hardware efficiency, with Hadamard rotation enabling NVINT4 to outperform NVFP4 in some cases. Hardware cost modeling further demonstrates substantial energy and area savings for fine-grained INT formats, arguing for a shift in hardware design toward refined INT quantization to balance accuracy, power, and efficiency in future accelerators.

Abstract

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

Paper Structure

This paper contains 28 sections, 43 equations, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: Compute flow of low-bit forward and backward propagation of linear layer.
  • Figure 2: Impact of clipping range on INT8 final training loss on 145M model with 20B training tokens. Scale factor is kept on BF16 to emphasize the harm of asymmetric representation space during low-bit training.
  • Figure 3: Theoretical QSNR comparison between various integer (INT) and floating-point (FP) formats across a range of crest factors ($\kappa$), derived from Eq. (\ref{['eq:int_qsnr']}) and Eq. (\ref{['eq:fp_qsnr']}). The boxes represent the crest factor and QSNR of the crossover point of the INT and FP curves.
  • Figure 4: Practical QSNR across crest factors from 10752 tensors source from ① to ⑥ in compute flow in Figure \ref{['fig:quant_flow']}. (a) is the results from vanilla tensor and (b) applies random hadamard rotation to the tensor before quantization. The box in top right report the average QSNR of INT and FP quantization, and the win rates of INT and FP quantization.
  • Figure 5: Loss curves comparison among BF16, MXFP8 and MXINT8 training on Llama-1B with 100B tokens. Results are smoothed by exponential moving average with a coefficient of 0.9.