MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

Vikas Natesh; H. T. Kung; David Kong

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

Vikas Natesh, H. T. Kung, David Kong

TL;DR

MGS addresses the swamping and overflow challenges in low-bitwidth neural network accumulation by modeling dot-product partial sums as a Markov process and grouping mantissas by exponent. It combines a narrow accumulator for the majority of sums with a wide accumulator that is used only on overflow, and it provides a dual-accumulator MAC (dMAC) design for both integer and FP8 computations. Empirical results show that MGS achieves FP32-equivalent accuracy on several image-classification tasks while reducing average accumulator width, and hardware evaluations indicate substantial energy savings (up to 34.1%) in FP8 and INT8 implementations. The approach enables practical, retraining-free deployment of low-bitwidth DNNs with improved numerical stability and lower power, suitable for both FPGA prototyping and 7 nm ASIC deployment.

Abstract

We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

TL;DR

Abstract

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)