Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Shivam Aggarwal; Hans Jakob Damsgaard; Alessandro Pappalardo; Giuseppe Franco; Thomas B. Preußer; Michaela Blott; Tulika Mitra

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Shivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika Mitra

TL;DR

The paper tackles the challenge of efficient neural network inference on FPGAs by enabling minifloat post-training quantization across 3–8 bit widths for weights and activations. It develops a two-level minifloat quantization flow and a custom FPGA MAC library, and adapts established PTQ techniques (SmoothQuant, Bias Correction, Learned Rounding, GPTQ) to the minifloat setting. Through experiments on ResNet-18, MobileNetV2, and ViT-B-32 with ImageNet, it shows minifloats can match or surpass integer quantization in accuracy at 4–8 bits, while reducing memory footprint, albeit with nuanced LUT trade-offs on FPGA hardware. The findings highlight meaningful memory-cost benefits for transformer workloads and demonstrate a hardware-aware design space where per-layer format choices and PTQ techniques interact to shape accuracy and resource utilization.

Abstract

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

TL;DR

Abstract

Paper Structure (18 sections, 10 equations, 4 figures, 2 tables)

This paper contains 18 sections, 10 equations, 4 figures, 2 tables.

Introduction
Related Work
Background: Integer Quantization
Minifloat Quantization
PTQ Optimization for Minifloats
SmoothQuant
Bias Correction
Gradient-based Learned Rounding
GPTQ
FPGA Operator Library
Experimental Setup
Evaluation & Discussion
Impact on Model Accuracy
Optimal Minifloat Formats
Impact on Memory Footprint
...and 3 more sections

Figures (4)

Figure 1: PTQ process with INT4 representation and E2M1 representation.
Figure 2: Simplified illustrations of the considered integer and minifloat MACs with two operands, $\textbf{a}$ and $\textbf{b}$.
Figure 3: Trade-off analysis between model accuracy (%) and memory footprint for integers (INT) and minifloats (FP).
Figure 4: Trade-off analysis between model accuracy (%) and #LUT utilization for integers (INT) and minifloats (FP). Points where the gap between the two formats converge are labeled with their corresponding bit-width configurations as (W$\times$A).

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

TL;DR

Abstract

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)