Table of Contents
Fetching ...

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Tomer Gafni, Asaf Karnieli, Yair Hanani

TL;DR

This work tackles the challenge of deploying large DNNs by proposing a hardware-friendly post-training quantization approach that combines 4-bit weight storage with FP8 computation (W$4$A$8$). The core methodology, Dual Precision Quantization ($DPQ$), performs offline INT4 weight quantization and online FP8 arithmetic, while activations are quantized to FP8 and outputs kept in BF16 to reduce error; Group-Aware Reordering ($GAR$) further mitigates accuracy loss by Hessian-guided, constrained weight reordering that preserves inference efficiency. The paper reports substantial throughput gains across language and vision tasks on Llama and Qwen-VL models, with DPQ outperforming or matching existing W4A8 schemes and closely tracking full-precision baselines. The proposed framework is modular and compatible with other quantization advances, offering a practical pathway to efficient, accurate inference on modern accelerators.

Abstract

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

TL;DR

This work tackles the challenge of deploying large DNNs by proposing a hardware-friendly post-training quantization approach that combines 4-bit weight storage with FP8 computation (WA). The core methodology, Dual Precision Quantization (), performs offline INT4 weight quantization and online FP8 arithmetic, while activations are quantized to FP8 and outputs kept in BF16 to reduce error; Group-Aware Reordering () further mitigates accuracy loss by Hessian-guided, constrained weight reordering that preserves inference efficiency. The paper reports substantial throughput gains across language and vision tasks on Llama and Qwen-VL models, with DPQ outperforming or matching existing W4A8 schemes and closely tracking full-precision baselines. The proposed framework is modular and compatible with other quantization advances, offering a practical pathway to efficient, accurate inference on modern accelerators.

Abstract

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.

Paper Structure

This paper contains 20 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An illustration of the GAR method. In this example, the weight tensor is divided into three groups, indicated by the red superscript. The black arrows represent the local ordering within each group, based on the diagonal elements of the Hessian for that group, while the red arrows denote the global ordering between groups, determined by the maximum diagonal Hessian element of each group. At the end of the quantization process, the tensor is re-permuted to its original order. Notably, scales operate on consecutive weights after the ordering process (as opposed to activation reordering done in gptq.)
  • Figure 2: High-level illustration of the proposed DPQ algorithm. We begin by hybrid reordering (GAR) of the weights based on their importance. Next, we apply two quantization levels: FP8 and INT4. The quantization errors from both processes are computed and distributed to the next, less 'important' weights. Finally, we re-permute the tensor and the scales back to their original order.
  • Figure 3: An illustration of the inference flow of our proposed W4A8 scheme.
  • Figure 4: Speed-up comparison between 3 methods on Gaudi 2 and Gaudi 3 accelerators. Each subplot compares speed-up of W4A8 (red) and W8A8 (blue) compared to W4A16 (green). The two upper plots compare Llama 2 70B on Gaudi 2 (left) and Gaudi 3 (right). The bottom plots compare Qwen 2 72B on Gaudi 2 (left) and Gaudi 3 (right). As can be seen, W4A8 can reach up to 3x speed-up over W4A16, and up to 1.4x speed-up over W8A8. W4A8 uses GAR, thus weights are consecutive in memory during inference.
  • Figure 5: An illustration of per-group scale and zero-point. In this example, the weight tensor is divided into three groups, each containing two weights (i.e., a group size of 2).
  • ...and 2 more figures