Table of Contents
Fetching ...

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Jiajun Zhou, Jiajun Wu, Yizhao Gao, Yuhao Ding, Chaofan Tao, Boyu Li, Fengbin Tu, Kwang-Ting Cheng, Hayden Kwok-Hay So, Ngai Wong

TL;DR

This paper tackles the challenge of accurate quantization for neural networks at very low bitwidths by introducing DyBit, a variable-length exponent representation that adapts to weight and activation distributions. It combines a hardware-efficient, run-time configurable mixed-precision accelerator with a hardware-aware quantization framework that optimizes layer-wise bitwidth under latency and RMSE constraints. DyBit demonstrates near-FP accuracy at 8 bits and substantial end-to-end speedups (up to $8.1\times$) across models like ResNet and MobileNetV2, outperforming state-of-the-art methods at 4 bits by nearly $2\%$ Top-1 accuracy. The approach offers a practical pathway to deploy low-bitwidth DNN inference with high efficiency on real hardware, while supporting multiple models via a cycle-accurate simulator-guided search.

Abstract

To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

TL;DR

This paper tackles the challenge of accurate quantization for neural networks at very low bitwidths by introducing DyBit, a variable-length exponent representation that adapts to weight and activation distributions. It combines a hardware-efficient, run-time configurable mixed-precision accelerator with a hardware-aware quantization framework that optimizes layer-wise bitwidth under latency and RMSE constraints. DyBit demonstrates near-FP accuracy at 8 bits and substantial end-to-end speedups (up to ) across models like ResNet and MobileNetV2, outperforming state-of-the-art methods at 4 bits by nearly Top-1 accuracy. The approach offers a practical pathway to deploy low-bitwidth DNN inference with high efficiency on real hardware, while supporting multiple models via a cycle-accurate simulator-guided search.

Abstract

To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
Paper Structure (23 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An illustration of different $N$-bit numerical arithmetic formats including FP, Posits and DyBit numbers.
  • Figure 2: The diagram of the proposed DyBit quantization.
  • Figure 3: Mixed-precision hardware system based on the proposed DyBit representation. (a) Hardware architecture based on the systolic array, (b) mixed-precision decoder (MP Decoder), and (c) mixed-precision mantissa multiplier (MAN. MUL).
  • Figure 4: DyBit-Based hardware-aware quantization framework.
  • Figure 5: Speedup and accuracy evaluations on the speedup-constrained strategy (the first row) and the RMSE-constrained strategy (the second row), based on MobileNetV2 and ResNet18/50 models. The target platform is Xilinx ZCU102.
  • ...and 1 more figures