Table of Contents
Fetching ...

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

Vikas Natesh, H. T. Kung, David Kong

TL;DR

MGS addresses the swamping and overflow challenges in low-bitwidth neural network accumulation by modeling dot-product partial sums as a Markov process and grouping mantissas by exponent. It combines a narrow accumulator for the majority of sums with a wide accumulator that is used only on overflow, and it provides a dual-accumulator MAC (dMAC) design for both integer and FP8 computations. Empirical results show that MGS achieves FP32-equivalent accuracy on several image-classification tasks while reducing average accumulator width, and hardware evaluations indicate substantial energy savings (up to 34.1%) in FP8 and INT8 implementations. The approach enables practical, retraining-free deployment of low-bitwidth DNNs with improved numerical stability and lower power, suitable for both FPGA prototyping and 7 nm ASIC deployment.

Abstract

We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

TL;DR

MGS addresses the swamping and overflow challenges in low-bitwidth neural network accumulation by modeling dot-product partial sums as a Markov process and grouping mantissas by exponent. It combines a narrow accumulator for the majority of sums with a wide accumulator that is used only on overflow, and it provides a dual-accumulator MAC (dMAC) design for both integer and FP8 computations. Empirical results show that MGS achieves FP32-equivalent accuracy on several image-classification tasks while reducing average accumulator width, and hardware evaluations indicate substantial energy savings (up to 34.1%) in FP8 and INT8 implementations. The approach enables practical, retraining-free deployment of low-bitwidth DNNs with improved numerical stability and lower power, suitable for both FPGA prototyping and 7 nm ASIC deployment.

Abstract

We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.

Paper Structure

This paper contains 22 sections, 1 theorem, 7 equations, 10 figures, 3 tables.

Key Result

theorem 1

Let $X = \{x_1, x_2, ..., x_k\}$ be a list of $k$ signed integers, where each $x_i$ is represented using $n$ bits. Let $y = \sum_{i=1}^{k} x_i$ be the sum of all elements in $X$, representable using $m \geq n + 1$ bits without persistent overflow (i.e., $-2^{m-1} \leq y \leq 2^{m-1} - 1$). Then, the

Figures (10)

  • Figure 1: (a) Example of Markov Greedy Sums (MGS). In (a), we sum 12 integers into a narrow accumulator (green box) until the sum $s_i$ overflows the range [-15, 15]. Then, we accumulate $s_i$ into a wider accumulator (red box). The underlined red values are those that would have caused an overflow of the narrow accumulator, noting that 15 + 2 > 15 and -9-7 < -15. In (b), we accumulate FP8 values by separating them by their exponents into 16 groups, summing the mantissas into separate narrow accumulators indexed by the exponent, and using a wide accumulator upon overflow. MGS amortizes the cost of aligning (shifting) FP8 mantissas over many sums.
  • Figure 2: An example of mantissa bit swamping when adding two E4M3 values with different exponents, $A =-0.25$ and $B=-0.029297$, while using a narrow 4-bit accumulator. The exponent bias in E4M3 is 7. A's exponent of 5 is larger than B's exponent 1 (b), causing B's mantissa to be shifted left by 5-1=4 bits (c). Since the entire mantissa shifts out, B is treated as zero, and the final result is 0.25, differing from the closest FP8 result of -0.28125 (d).
  • Figure 3: % Error, relative to FP32 precision, of Gaussian vector dot products performed in FP8 precision. We execute each algorithm using solely a narrow accumulator and clip partial sums upon overflow. All algorithms exhibit significant errors due to the swamping of lower order bits when using reduced-precision accumulators. MGS has lower error than pairwise summation by separating partial product mantissas by exponent and accumulating them in separate narrow accumulators. This means that dot product errors result only from clipping overflows. However, the $\approx 35\%$ error of MGS, when restricted to a narrow accumulator, is unacceptable for DNN applications.
  • Figure 4: (a) We estimate the probability of overflow based on the model described in Section \ref{['sec:prob']}, when performing dot product at different accumulator bitwidths. 5-bit Gaussian weights in the range[-15,15] are multiplied with 7-bit Gaussian activations in [-63,63] to yield partial products $Z \approx N(0, k * \sigma_w \sigma_x)$. We set $\sigma$ of weights and data such that the extreme values lie 3 $\sigma$'s away from the mean 0, i.e., $\sigma_w = 15/3 = 5$ and $\sigma_x = 63/3 = 21$. The figure shows that despite 7+5=12-bit partial products, we can use accumulators with < 12 bits for most sums before overflow. For example, there is only a $\approx$ 12% chance of overflow when summing 10 elements in a narrow 10-bit accumulator. In (b), we plot the average accumulator bitwidth when running MobileNetv2 inference with 5-bit weights and 7-bit activations. Although one would expect that at least 5+7=12 bits are required to prevent overflow, the average accumulator bitwidth required varies between 7 and 10 bits.
  • Figure 5: Plotting the empirical measured average dot product length versus expected dot product length based on our random walk model. 5-bit Weights follow a normal distribution in the range [-15, 15], while 7-bit activations have a half-normal distribution in the range [0,127] after ReLU. Note that the plot shows that with the accumulation bitwidth equal to 10, we do not expect overflow at a summation length of about 32. In contrast, a naive analysis would conclude that 17 = 5+7+5 bits are required to avoid overflows, noting that $5 = \log_2{32}$.
  • ...and 5 more figures

Theorems & Definitions (3)

  • definition 1: Transient Overflow
  • definition 2: Persistent Overflow
  • theorem 1