HiFloat4 Format for Language Model Inference
Yuanyong Luo, Jing Huang, Yu Cheng, Ziwei Yu, Kaihua Zhang, Kehong Hong, Xinda Ma, Xin Wang, Anping Tong, Guipeng Hu, Yun Xu, Mehran Taghian, Peng Wu, Guanglin Li, Yunke Peng, Tianchi Hu, Minqi Chen, Michael Bi Mi, Hu Liu, Xiping Zhou, Junsong Wang, Qiang Lin, Heng Liao
TL;DR
This work introduces HiF4, a 4-bit block floating-point format designed for efficient LLM inference. It uses a 64-element group with a three-level scaling hierarchy (E6M2 base scale plus E1_8 and E1_16 micro-exponents) and 32 bits of shared metadata, achieving an average of $4.5$ bits per value and enabling largely fixed-point dot-product computation. The paper provides a BF16-to-HiF4 conversion, a dedicated PTQ method called HiGPTQ, and comprehensive comparisons against NVFP4 across multiple models, showing improved quantization accuracy and hardware efficiency, including a ~$24\%$ reduction in MSE over NVFP4 on Gaussian data and a ~ $10\%$ power saving for 64-length dot products. Experimental results on small and large LLMs (LLaMA, Qwen, Mistral, DeepSeek-V3.1, LongCat) demonstrate HiF4’s superior inference accuracy and robustness, with some cases surpassing BF16 baselines. Overall, HiF4 represents a practical 4-bit BFP solution that balances numerical precision and hardware practicality for large-scale language-model inference.
Abstract
This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.
