Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Yaniv Blumenfeld; Itay Hubara; Daniel Soudry

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Yaniv Blumenfeld, Itay Hubara, Daniel Soudry

TL;DR

This work presents a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy.

Abstract

The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

TL;DR

This work presents a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper,

-bits accumulators, with no significant degradation in accuracy.

Abstract

-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.

Paper Structure (23 sections, 17 equations, 2 figures, 10 tables)

This paper contains 23 sections, 17 equations, 2 figures, 10 tables.

Introduction
Preliminaries: Quantized Neural Networks
Quantized weights and activations
Fixed point Quantization
Floating point Quantization
Low Bit-Width Accumulators
Fine-tuning Neural Networks with Low-Bit Accumulators
Experiments: Image Classification
Experiments: Language Models
Below $12$ bits: Fine-grained Gradient for Low Bit Accumulators
Discussion
General Matrix Multiplication: Example
Effect of quantized FMA on zero-shot accuracy
Experiments Implementation Details
ImageNet
...and 8 more sections

Figures (2)

Figure 1: Left: an illustration of quantized FMA component, as simulated in our work. Unlike the W/A quantization operations ($Q_W(w),Q_A(x)$) that can be efficiently performed in software, $Q_\text{prod}$ and $Q_\text{acc}$ are explicitly internal hardware operations, intended to simulate the logic of a cheaper hardware component. Right: Illustration of chunk-based accumulation, with chunk base of $n$. Chunk-based accumulation is useful for reducing error caused by swamping, but the chunk size is not easily configured and will usually depend on the architecture design of the systolic array.
Figure 2: Wide scope loss landscapes visualloss of an LBA resnet50, using pre-trained ResNet50 weights (CIFAR10, FP32). Here, we compare the qualitative effect of different components in floating points quantization over the network output: In (a), we use a complete implementation of FP quantization during convolution accumulation, with $7$ Mantissa and $4$ Exponent bits. In (b), we repeat the previous experiment but ignore underflow events during quantization. For comparison, in (c), we repeat the original experiment, but add $16$ additional bits to the mantissa, greatly diminishing the effect of swamping, without affecting the role of underflow. All landscapes appear similar, but while the effect of excluding swamping events (c) is visible, the loss landscapes of networks with (a) and without (b) underflow are hardly distinguishable.

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

TL;DR

Abstract

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Authors

TL;DR

Abstract

Table of Contents

Figures (2)