Table of Contents
Fetching ...

PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

Sachit Kuhar, Yash Jain, Alexey Tumanov

TL;DR

This work tackles the inefficiency of DNN inference on edge devices due to data movement and limited compute by introducing the repetition-sparsity trade-off. It proposes PLUM, a co-design framework that combines two signed-binary quantization functions with region-based local binarization to exploit this trade-off across DNN blocks, guided by backpropagation with an adapted gradient mechanism. Empirically, PLUM pushes the Pareto frontier on CIFAR-10 and ImageNet, achieving a 26% hardware speedup, roughly 2x energy efficiency, and a 2.8x reduction in density compared with binary baselines, while preserving top-1 accuracy (e.g., 66.2% on ImageNet for ResNet). The approach yields a practical, scalable path toward deploying accurate, efficient models in resource-constrained environments and provides insights into the latent-distribution dynamics under signed-binary co-design."

Abstract

Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose PLUM, a unified co-design framework that integrates DNN inference systems and quantization (forward and backward pass) to leverage the repetition-sparsity trade-off to improve inference efficiency. Our results demonstrate that PLUM's quantization method is more accurate than binary quantization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters of latent full-precision weights for a DNN block. Finally, the proposed PLUM framework achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods while retaining top-1 accuracy when compared to prior-art methods for ResNets on ImageNet (by achieving 66.2% top-1 accuracy), presenting an alternative solution for deploying efficient models in resource-limited environments.

PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

TL;DR

This work tackles the inefficiency of DNN inference on edge devices due to data movement and limited compute by introducing the repetition-sparsity trade-off. It proposes PLUM, a co-design framework that combines two signed-binary quantization functions with region-based local binarization to exploit this trade-off across DNN blocks, guided by backpropagation with an adapted gradient mechanism. Empirically, PLUM pushes the Pareto frontier on CIFAR-10 and ImageNet, achieving a 26% hardware speedup, roughly 2x energy efficiency, and a 2.8x reduction in density compared with binary baselines, while preserving top-1 accuracy (e.g., 66.2% on ImageNet for ResNet). The approach yields a practical, scalable path toward deploying accurate, efficient models in resource-constrained environments and provides insights into the latent-distribution dynamics under signed-binary co-design."

Abstract

Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose PLUM, a unified co-design framework that integrates DNN inference systems and quantization (forward and backward pass) to leverage the repetition-sparsity trade-off to improve inference efficiency. Our results demonstrate that PLUM's quantization method is more accurate than binary quantization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters of latent full-precision weights for a DNN block. Finally, the proposed PLUM framework achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods while retaining top-1 accuracy when compared to prior-art methods for ResNets on ImageNet (by achieving 66.2% top-1 accuracy), presenting an alternative solution for deploying efficient models in resource-limited environments.
Paper Structure (46 sections, 4 equations, 11 figures, 13 tables)

This paper contains 46 sections, 4 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: On the left: The conventional, isolated approach where DNN inference systems and quantization methods are designed separately, resulting in being ignorant of repetition-sparsity trade-off, leading to inefficient inference. On the right: PLUM, a unified design framework, performs quantization-system co-design to exploit the repetition-sparsity trade-off, thereby enhancing computational efficiency.
  • Figure 2: PLUM vs. Prior-Art: For ResNet on ImageNet, the pronounced spread indicates that PLUM holistically outperforms prior-art method of partitioned design using binary quantization. It retains competitive accuracy and pushes the Pareto front, exhibiting a +2.5% improvement when both methods employ a comparable number of effectual parameters. Moreover, our method enhances inference efficiency, achieving a 26% speedup, doubling energy efficiency, and reducing density by 2.8x for the same backbone.
  • Figure 3: Concept Diagram on the left: The diagram shows the comparison of Binary, Ternary, and Signed Binary Quantization in terms of visual representation of their quantized weights. Qualitative Evaluation on the right: The table qualitatively evaluates them in terms of weight sparsity, weight repetition, and inference efficiency.
  • Figure 4: PLUM framework leads to efficient inference as it acknowledges repetition-sparsity trade-off through co-design: Visualizing inference when using recent systems prabhakar2021summergefu2022q. Weight repetition enables binary models to skip work by re-using partial sums within and across filters. PLUM takes this even further by reducing the number of effectual operations by leveraging sparsity while retaining repetition. (details in Supp \ref{['sec:exploit_repetition_sparsity']})
  • Figure 5: Comparison of PLUM and conventional binary methods on CIFAR10 and ImageNet datasets. PLUM pushes the Pareto frontier, providing superior accuracy with respect to effectual parameters and exhibiting a significant reduction in effectual parameters of equivalent models.
  • ...and 6 more figures