Table of Contents
Fetching ...

FAAR: Format-Aware Adaptive Rounding for NVFP4

Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, Kun Zhan

Abstract

Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.

FAAR: Format-Aware Adaptive Rounding for NVFP4

Abstract

Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.
Paper Structure (29 sections, 7 equations, 2 figures, 8 tables)

This paper contains 29 sections, 7 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The overall pipeline of our proposed quantization framework combining FAAR and 2FA.(1) FAAR + Stage 1: The continuous learnable rounding variables $V$ are carefully initialized and incorporated into the layer-wise NVFP4 optimization. During this stage, we perform layer-wise rounding optimization against the frozen BF16 model. For each optimization step, only the current layer is updated while the rest of the network remains frozen. The rounding decisions are optimized via learnable variables parameterized by a differentiable sigmoid function and guided by Round Loss. (2) FAAR + Stage 2: The quantized layers are assembled into a full NVFP4 model and jointly optimized. Model-wise alignment is performed using KL divergence and last-hidden-state MSE losses to mitigate full-model error accumulation. (3) Hardening and Inference: After the two-stage optimization, the continuous variables $V$ are deterministically hardened into discrete binary decisions, which are then deployed for efficient inference.
  • Figure 2: The non-uniform NVFP4 grid introduces magnitude-dependent rounding errors.(a) The mapping function relative to the original weight $w$, showing numerical nodes that are densely concentrated near zero but become increasingly sparse for larger magnitudes. (b) The absolute quantization error, highlighting the low-error region (shaded green) optimized for the core weight distribution and the amplified distortion for weights with larger magnitudes.