HiFloat4 Format for Language Model Inference

Yuanyong Luo; Jing Huang; Yu Cheng; Ziwei Yu; Kaihua Zhang; Kehong Hong; Xinda Ma; Xin Wang; Anping Tong; Guipeng Hu; Yun Xu; Mehran Taghian; Peng Wu; Guanglin Li; Yunke Peng; Tianchi Hu; Minqi Chen; Michael Bi Mi; Hu Liu; Xiping Zhou; Junsong Wang; Qiang Lin; Heng Liao

HiFloat4 Format for Language Model Inference

Yuanyong Luo, Jing Huang, Yu Cheng, Ziwei Yu, Kaihua Zhang, Kehong Hong, Xinda Ma, Xin Wang, Anping Tong, Guipeng Hu, Yun Xu, Mehran Taghian, Peng Wu, Guanglin Li, Yunke Peng, Tianchi Hu, Minqi Chen, Michael Bi Mi, Hu Liu, Xiping Zhou, Junsong Wang, Qiang Lin, Heng Liao

TL;DR

This work introduces HiF4, a 4-bit block floating-point format designed for efficient LLM inference. It uses a 64-element group with a three-level scaling hierarchy (E6M2 base scale plus E1_8 and E1_16 micro-exponents) and 32 bits of shared metadata, achieving an average of $4.5$ bits per value and enabling largely fixed-point dot-product computation. The paper provides a BF16-to-HiF4 conversion, a dedicated PTQ method called HiGPTQ, and comprehensive comparisons against NVFP4 across multiple models, showing improved quantization accuracy and hardware efficiency, including a ~$24\%$ reduction in MSE over NVFP4 on Gaussian data and a ~ $10\%$ power saving for 64-length dot products. Experimental results on small and large LLMs (LLaMA, Qwen, Mistral, DeepSeek-V3.1, LongCat) demonstrate HiF4’s superior inference accuracy and robustness, with some cases surpassing BF16 baselines. Overall, HiF4 represents a practical 4-bit BFP solution that balances numerical precision and hardware practicality for large-scale language-model inference.

Abstract

This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.

HiFloat4 Format for Language Model Inference

TL;DR

bits per value and enabling largely fixed-point dot-product computation. The paper provides a BF16-to-HiF4 conversion, a dedicated PTQ method called HiGPTQ, and comprehensive comparisons against NVFP4 across multiple models, showing improved quantization accuracy and hardware efficiency, including a ~

reduction in MSE over NVFP4 on Gaussian data and a ~

power saving for 64-length dot products. Experimental results on small and large LLMs (LLaMA, Qwen, Mistral, DeepSeek-V3.1, LongCat) demonstrate HiF4’s superior inference accuracy and robustness, with some cases surpassing BF16 baselines. Overall, HiF4 represents a practical 4-bit BFP solution that balances numerical precision and hardware practicality for large-scale language-model inference.

Abstract

Paper Structure (15 sections, 3 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
HiFloat4
Format Definition
Scaling Metadata
Element Encoding
Represented Values
Format Conversion
Quantization Error and Dot Product
Quantization Error Evaluation
Dot Product Evaluation
Language Model Inference with HiFloat4
Post-Training Quantization for LLMs
Experiments on Small LLMs
Experiments on DeepSeek-V3.1 and LongCat
Conclusion

Figures (4)

Figure 1: The Structure of Three 4-bit Block Floating-Point Formats
Figure 2: The Structure of HiF4 Block Floating-Point Format
Figure 3: Quantization Error Comparison of 4-bit BFP Formats
Figure 4: Compute Flow of 64-length Dot Product for HiF4 and NVFP4

HiFloat4 Format for Language Model Inference

TL;DR

Abstract

HiFloat4 Format for Language Model Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (4)