PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off
Sachit Kuhar, Yash Jain, Alexey Tumanov
TL;DR
This work tackles the inefficiency of DNN inference on edge devices due to data movement and limited compute by introducing the repetition-sparsity trade-off. It proposes PLUM, a co-design framework that combines two signed-binary quantization functions with region-based local binarization to exploit this trade-off across DNN blocks, guided by backpropagation with an adapted gradient mechanism. Empirically, PLUM pushes the Pareto frontier on CIFAR-10 and ImageNet, achieving a 26% hardware speedup, roughly 2x energy efficiency, and a 2.8x reduction in density compared with binary baselines, while preserving top-1 accuracy (e.g., 66.2% on ImageNet for ResNet). The approach yields a practical, scalable path toward deploying accurate, efficient models in resource-constrained environments and provides insights into the latent-distribution dynamics under signed-binary co-design."
Abstract
Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose PLUM, a unified co-design framework that integrates DNN inference systems and quantization (forward and backward pass) to leverage the repetition-sparsity trade-off to improve inference efficiency. Our results demonstrate that PLUM's quantization method is more accurate than binary quantization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters of latent full-precision weights for a DNN block. Finally, the proposed PLUM framework achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods while retaining top-1 accuracy when compared to prior-art methods for ResNets on ImageNet (by achieving 66.2% top-1 accuracy), presenting an alternative solution for deploying efficient models in resource-limited environments.
