Table of Contents
Fetching ...

PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations

Vikas Natesh, H. T. Kung

TL;DR

PQ​S tackles overflow in low-bitwidth dot-product accumulation by combining $N:M$ pruning, uniform quantization to $b$ bits, and a Sorted Dot Product that orders partial sums to prevent transient overflows. It analyzes persistent vs transient overflows and shows pruning in FP32 before quantization ($P\rightarrow Q$) yields better accuracy than quantizing first ($Q\rightarrow P$). Across MobileNetV2 and ResNet-18 on CIFAR-10, PQS achieves substantial accumulator width reduction while preserving FP32-level accuracy and demonstrates that sorting can eliminate most transient overflows. These results have practical implications for energy-efficient, high-throughput tinyML deployments and suggest avenues for hardware-aware optimization.

Abstract

We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order ("small to large") allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.

PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations

TL;DR

PQ​S tackles overflow in low-bitwidth dot-product accumulation by combining pruning, uniform quantization to bits, and a Sorted Dot Product that orders partial sums to prevent transient overflows. It analyzes persistent vs transient overflows and shows pruning in FP32 before quantization () yields better accuracy than quantizing first (). Across MobileNetV2 and ResNet-18 on CIFAR-10, PQS achieves substantial accumulator width reduction while preserving FP32-level accuracy and demonstrates that sorting can eliminate most transient overflows. These results have practical implications for energy-efficient, high-throughput tinyML deployments and suggest avenues for hardware-aware optimization.

Abstract

We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order ("small to large") allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed PQS (Prune, Quantize, and Sort) framework for enabling low-bitwidth accumulation in quantized neural networks. After sparsifying a floating-point model via N:M pruning, we quantize the model using quantization-aware training on the remaining weights. During inference, we reduce transient accumulation overflows through our sorted dot product algorithm
  • Figure 2: Profile of overflows during inference of a 1-layer MLP with 8-bit weight/activations trained on MNIST. Even though transient overflows only account for 3% of total overflows when using narrow 13-16 bit accumulators (a), resolving them improves accuracy from 10% to 40% (b) showing that for accumulators with low bitwidths, transient overflows can have a larger impact on accuracy.
  • Figure 3: We compare accuracy of pruning before quantization (P->Q) with quantization before pruning (Q->P) under low-rank approximations of weight matrices in a two-layer MLP. In contrast to Q->P, P->Q models are more resilient to low-rank weight approximations and suffer from less accuracy loss as sparsity increases.
  • Figure 4: We compare P->Q and Q->P in MobileNetV2 (a) and ResNet-18 (b) on CIFAR10. P->Q achieves up to 1.5% higher accuracy than Q->P while maintaining performance at higher sparsities. Structured filter pruning also performs poorly compared to N:M pruning in P->Q and Q->P.
  • Figure 5: We visualize the trade-off between accumulator bit width and model accuracy. PQS (blue) can make use of accumulators with lower bitwidth than A2Q, without sacrificing significant model performance relative to the floating-point baseline. Sorting before accumulating the dot product allows us to avoid transient overflows and use up to 4 fewer bits in the accumulator than if we clipped those overflows (magenta lines).