DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Jiajun Zhou; Jiajun Wu; Yizhao Gao; Yuhao Ding; Chaofan Tao; Boyu Li; Fengbin Tu; Kwang-Ting Cheng; Hayden Kwok-Hay So; Ngai Wong

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Jiajun Zhou, Jiajun Wu, Yizhao Gao, Yuhao Ding, Chaofan Tao, Boyu Li, Fengbin Tu, Kwang-Ting Cheng, Hayden Kwok-Hay So, Ngai Wong

TL;DR

This paper tackles the challenge of accurate quantization for neural networks at very low bitwidths by introducing DyBit, a variable-length exponent representation that adapts to weight and activation distributions. It combines a hardware-efficient, run-time configurable mixed-precision accelerator with a hardware-aware quantization framework that optimizes layer-wise bitwidth under latency and RMSE constraints. DyBit demonstrates near-FP accuracy at 8 bits and substantial end-to-end speedups (up to $8.1\times$) across models like ResNet and MobileNetV2, outperforming state-of-the-art methods at 4 bits by nearly $2\%$ Top-1 accuracy. The approach offers a practical pathway to deploy low-bitwidth DNN inference with high efficiency on real hardware, while supporting multiple models via a cycle-accurate simulator-guided search.

Abstract

To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

TL;DR

) across models like ResNet and MobileNetV2, outperforming state-of-the-art methods at 4 bits by nearly

Top-1 accuracy. The approach offers a practical pathway to deploy low-bitwidth DNN inference with high efficiency on real hardware, while supporting multiple models via a cycle-accurate simulator-guided search.

Abstract

Paper Structure (23 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
BACKGROUND AND RELATED WORK
Quantization Method
Mixed-Precision Hardware Accelerator
METHODOLOGY
Variable-Length Datatype
Hardware Design
Architecture
Decoder & Encoder
Mixed-precision PE
Hardware-aware Quantization Framework
Quantization Metrics
Two Search Strategies
Quantization Search Flow
Hardware Simulator
...and 8 more sections

Figures (6)

Figure 1: An illustration of different $N$-bit numerical arithmetic formats including FP, Posits and DyBit numbers.
Figure 2: The diagram of the proposed DyBit quantization.
Figure 3: Mixed-precision hardware system based on the proposed DyBit representation. (a) Hardware architecture based on the systolic array, (b) mixed-precision decoder (MP Decoder), and (c) mixed-precision mantissa multiplier (MAN. MUL).
Figure 4: DyBit-Based hardware-aware quantization framework.
Figure 5: Speedup and accuracy evaluations on the speedup-constrained strategy (the first row) and the RMSE-constrained strategy (the second row), based on MobileNetV2 and ResNet18/50 models. The target platform is Xilinx ZCU102.
...and 1 more figures

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

TL;DR

Abstract

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (6)