Combining Relevance and Magnitude for Resource-Aware DNN Pruning

Carla Fabiana Chiasserini; Francesco Malandrino; Nuria Molner; Zhiqiang Zhao

Combining Relevance and Magnitude for Resource-Aware DNN Pruning

Carla Fabiana Chiasserini, Francesco Malandrino, Nuria Molner, Zhiqiang Zhao

TL;DR

The paper tackles pruning DNNs for resource-constrained edge deployments by reducing bandwidth and latency without sacrificing accuracy. It introduces FlexRel, which blends training-time magnitude with inference-time relevance into a unified pruning score, using $s = (1-\delta) M_{norm} + \delta R_{norm}$ with a tunable $\delta$ to balance the two signals. Relevance is computed at inference via a parameter-level extension of input-output relevance, expressed as $\mathsf{rel}(w_{kj}) = \sum_{i=1}^n\left[\mathsf{rel}(a_{ik}) + \mathsf{rel}(b_{ij})\right]$, and convolutional layers are handled by converting to fully connected representations. Experiments on VGG16/ImageNet show FlexRel achieves higher pruning factors and substantial bandwidth savings compared with magnitude-only or relevance-only baselines, with modest overhead and guidance on selecting $\delta$.

Abstract

Pruning neural networks, i.e., removing some of their parameters whilst retaining their accuracy, is one of the main ways to reduce the latency of a machine learning pipeline, especially in resource- and/or bandwidth-constrained scenarios. In this context, the pruning technique, i.e., how to choose the parameters to remove, is critical to the system performance. In this paper, we propose a novel pruning approach, called FlexRel and predicated upon combining training-time and inference-time information, namely, parameter magnitude and relevance, in order to improve the resulting accuracy whilst saving both computational resources and bandwidth. Our performance evaluation shows that FlexRel is able to achieve higher pruning factors, saving over 35% bandwidth for typical accuracy targets.

Combining Relevance and Magnitude for Resource-Aware DNN Pruning

TL;DR

with a tunable

to balance the two signals. Relevance is computed at inference via a parameter-level extension of input-output relevance, expressed as

, and convolutional layers are handled by converting to fully connected representations. Experiments on VGG16/ImageNet show FlexRel achieves higher pruning factors and substantial bandwidth savings compared with magnitude-only or relevance-only baselines, with modest overhead and guidance on selecting

Abstract

Paper Structure (8 sections, 1 equation, 4 figures, 1 table)

This paper contains 8 sections, 1 equation, 4 figures, 1 table.

Introduction
Current Pruning Approaches
The FlexRel Approach
Inference-time importance of DNN parameters: Relevance scores
Using Relevance Scores
Numerical Results
Discussion and Open Issues
Conclusion

Figures (4)

Figure 1: Three ways to make pruning decisions: the traditional way (on the left-hand side), i.e., directly considering the magnitude of the DNN parameters and pruning those with the smallest magnitude; relevance-based pruning (on the right-hand side) -- with relevance being a quantity computed during inference -- using both the DNN parameters and input samples. Our FlexRel approach (in the middle), which combines both magnitude and relevance to make more effective pruning decisions.
Figure 2: Accuracy reached by the VGG16 DNN when trained over the ImageNet dataset as a function of the pruning factor, for different pruning techniques.
Figure 3: Elapsed learning time as a function of the accuracy target, for the magnitude-based (a) and FlexRel (b) techniques. Numbers in the plot represent the quantity of transmitted data.
Figure 4: Effect of the weighting factor $\delta$ over the achieved accuracy, for different pruning factors.

Combining Relevance and Magnitude for Resource-Aware DNN Pruning

TL;DR

Abstract

Combining Relevance and Magnitude for Resource-Aware DNN Pruning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)