PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

Sachit Kuhar; Yash Jain; Alexey Tumanov

PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

Sachit Kuhar, Yash Jain, Alexey Tumanov

TL;DR

This work tackles the inefficiency of DNN inference on edge devices due to data movement and limited compute by introducing the repetition-sparsity trade-off. It proposes PLUM, a co-design framework that combines two signed-binary quantization functions with region-based local binarization to exploit this trade-off across DNN blocks, guided by backpropagation with an adapted gradient mechanism. Empirically, PLUM pushes the Pareto frontier on CIFAR-10 and ImageNet, achieving a 26% hardware speedup, roughly 2x energy efficiency, and a 2.8x reduction in density compared with binary baselines, while preserving top-1 accuracy (e.g., 66.2% on ImageNet for ResNet). The approach yields a practical, scalable path toward deploying accurate, efficient models in resource-constrained environments and provides insights into the latent-distribution dynamics under signed-binary co-design."

Abstract

Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose PLUM, a unified co-design framework that integrates DNN inference systems and quantization (forward and backward pass) to leverage the repetition-sparsity trade-off to improve inference efficiency. Our results demonstrate that PLUM's quantization method is more accurate than binary quantization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters of latent full-precision weights for a DNN block. Finally, the proposed PLUM framework achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods while retaining top-1 accuracy when compared to prior-art methods for ResNets on ImageNet (by achieving 66.2% top-1 accuracy), presenting an alternative solution for deploying efficient models in resource-limited environments.

PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

TL;DR

Abstract

Paper Structure (46 sections, 4 equations, 11 figures, 13 tables)

This paper contains 46 sections, 4 equations, 11 figures, 13 tables.

Introduction
Background
Quantization
Weight Sparsity
Weight Repetition
Repetition, Sparsity & Inference Latency
PLUM
Leveraging Repetition-Sparsity Trade-off
Co-design Exploration
DNN inference in PLUM
Quantization in PLUM: Signed Binary
Intra-Filter Signed-Binary Quant
Inter-Filter Signed-Binary Quant
Value Assignment of Signed-Binary Quant Functions
Backpropagation in PLUM
...and 31 more sections

Figures (11)

Figure 1: On the left: The conventional, isolated approach where DNN inference systems and quantization methods are designed separately, resulting in being ignorant of repetition-sparsity trade-off, leading to inefficient inference. On the right: PLUM, a unified design framework, performs quantization-system co-design to exploit the repetition-sparsity trade-off, thereby enhancing computational efficiency.
Figure 2: PLUM vs. Prior-Art: For ResNet on ImageNet, the pronounced spread indicates that PLUM holistically outperforms prior-art method of partitioned design using binary quantization. It retains competitive accuracy and pushes the Pareto front, exhibiting a +2.5% improvement when both methods employ a comparable number of effectual parameters. Moreover, our method enhances inference efficiency, achieving a 26% speedup, doubling energy efficiency, and reducing density by 2.8x for the same backbone.
Figure 3: Concept Diagram on the left: The diagram shows the comparison of Binary, Ternary, and Signed Binary Quantization in terms of visual representation of their quantized weights. Qualitative Evaluation on the right: The table qualitatively evaluates them in terms of weight sparsity, weight repetition, and inference efficiency.
Figure 4: PLUM framework leads to efficient inference as it acknowledges repetition-sparsity trade-off through co-design: Visualizing inference when using recent systems prabhakar2021summergefu2022q. Weight repetition enables binary models to skip work by re-using partial sums within and across filters. PLUM takes this even further by reducing the number of effectual operations by leveraging sparsity while retaining repetition. (details in Supp \ref{['sec:exploit_repetition_sparsity']})
Figure 5: Comparison of PLUM and conventional binary methods on CIFAR10 and ImageNet datasets. PLUM pushes the Pareto frontier, providing superior accuracy with respect to effectual parameters and exhibiting a significant reduction in effectual parameters of equivalent models.
...and 6 more figures

PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

TL;DR

Abstract

PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

Authors

TL;DR

Abstract

Table of Contents

Figures (11)