Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference
Tomer Gafni, Asaf Karnieli, Yair Hanani
TL;DR
This work tackles the challenge of deploying large DNNs by proposing a hardware-friendly post-training quantization approach that combines 4-bit weight storage with FP8 computation (W$4$A$8$). The core methodology, Dual Precision Quantization ($DPQ$), performs offline INT4 weight quantization and online FP8 arithmetic, while activations are quantized to FP8 and outputs kept in BF16 to reduce error; Group-Aware Reordering ($GAR$) further mitigates accuracy loss by Hessian-guided, constrained weight reordering that preserves inference efficiency. The paper reports substantial throughput gains across language and vision tasks on Llama and Qwen-VL models, with DPQ outperforming or matching existing W4A8 schemes and closely tracking full-precision baselines. The proposed framework is modular and compatible with other quantization advances, offering a practical pathway to efficient, accurate inference on modern accelerators.
Abstract
Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.
