PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations
Vikas Natesh, H. T. Kung
TL;DR
PQS tackles overflow in low-bitwidth dot-product accumulation by combining $N:M$ pruning, uniform quantization to $b$ bits, and a Sorted Dot Product that orders partial sums to prevent transient overflows. It analyzes persistent vs transient overflows and shows pruning in FP32 before quantization ($P\rightarrow Q$) yields better accuracy than quantizing first ($Q\rightarrow P$). Across MobileNetV2 and ResNet-18 on CIFAR-10, PQS achieves substantial accumulator width reduction while preserving FP32-level accuracy and demonstrates that sorting can eliminate most transient overflows. These results have practical implications for energy-efficient, high-throughput tinyML deployments and suggest avenues for hardware-aware optimization.
Abstract
We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order ("small to large") allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.
