Table of Contents
Fetching ...

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han

TL;DR

This work tackles divergence and accuracy loss when training or deploying large language models with NVFP4 quantization. It introduces Four Over Six (4/6), an adaptive per-block scaling strategy that evaluates both M=4 and M=6 scales for each block and selects the better representation based on mean-squared-error, implemented efficiently on NVIDIA Blackwell GPUs. Empirical results show 4/6 stabilizes pre-training across transformer and hybrid architectures, and improves post-training quantization performance across multiple PTQ methods, with AWQ+4/6 often yielding the best perplexities and task metrics. The contribution enables more robust training and deployment of NVFP4-quantized LLMs, with publicly available code for broader adoption and further research.

Abstract

As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4. Our code is available at http://github.com/mit-han-lab/fouroversix.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

TL;DR

This work tackles divergence and accuracy loss when training or deploying large language models with NVFP4 quantization. It introduces Four Over Six (4/6), an adaptive per-block scaling strategy that evaluates both M=4 and M=6 scales for each block and selects the better representation based on mean-squared-error, implemented efficiently on NVIDIA Blackwell GPUs. Empirical results show 4/6 stabilizes pre-training across transformer and hybrid architectures, and improves post-training quantization performance across multiple PTQ methods, with AWQ+4/6 often yielding the best perplexities and task metrics. The contribution enables more robust training and deployment of NVFP4-quantized LLMs, with publicly available code for broader adoption and further research.

Abstract

As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4. Our code is available at http://github.com/mit-han-lab/fouroversix.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: In standard NVFP4 quantization (left), using the full range of FP4 values from 0 to 6 means that it is impossible to represent values between 66.6% and 100% of the magnitude of the largest value in a block. By instead scaling some blocks to a maximum value of 4, it becomes possible to represent values that are 75% of the largest value in a block, reducing worst-case quantization error for large values.
  • Figure 2: Simulated NVFP4 quantization with Llama-3.1-8B evaluated on WikiText-2 word perplexity. To improve NVFP4 performance, we find that we should focus on improving the representation of specific values in each block.
  • Figure 3: Computational flow of an NVFP4 quantized linear layer trained with Four Over Six. All matrix multiplications (FPROP, DGRAD, WGRAD) are performed in NVFP4, while model weights are stored in FP32, and activations and gradients are stored in BF16. Q(4/6) denotes our method where blocks are scaled to either 4 or 6 based on the distribution of values. SR denotes Stochastic Rounding, and RHT denotes Random Hadamard Transform. Blue paths represent FP32 data flow; orange paths represent BF16; green paths represent NVFP4.
  • Figure 4: Four Over Six Can Mitigate Divergence During Pre-Training. Training loss curves comparing BF16, NVFP4, and NVFP4 with 4/6 for various model architectures and sizes. NVFP4 diverges in each case, forcing us to stop these runs early. Adding 4/6 keeps training loss closer to BF16 in all cases.
  • Figure 5: Training loss for our 340-million-parameter Transformer, including an ablation where 2D block scaling is not used.