Combining Relevance and Magnitude for Resource-Aware DNN Pruning
Carla Fabiana Chiasserini, Francesco Malandrino, Nuria Molner, Zhiqiang Zhao
TL;DR
The paper tackles pruning DNNs for resource-constrained edge deployments by reducing bandwidth and latency without sacrificing accuracy. It introduces FlexRel, which blends training-time magnitude with inference-time relevance into a unified pruning score, using $s = (1-\delta) M_{norm} + \delta R_{norm}$ with a tunable $\delta$ to balance the two signals. Relevance is computed at inference via a parameter-level extension of input-output relevance, expressed as $\mathsf{rel}(w_{kj}) = \sum_{i=1}^n\left[\mathsf{rel}(a_{ik}) + \mathsf{rel}(b_{ij})\right]$, and convolutional layers are handled by converting to fully connected representations. Experiments on VGG16/ImageNet show FlexRel achieves higher pruning factors and substantial bandwidth savings compared with magnitude-only or relevance-only baselines, with modest overhead and guidance on selecting $\delta$.
Abstract
Pruning neural networks, i.e., removing some of their parameters whilst retaining their accuracy, is one of the main ways to reduce the latency of a machine learning pipeline, especially in resource- and/or bandwidth-constrained scenarios. In this context, the pruning technique, i.e., how to choose the parameters to remove, is critical to the system performance. In this paper, we propose a novel pruning approach, called FlexRel and predicated upon combining training-time and inference-time information, namely, parameter magnitude and relevance, in order to improve the resulting accuracy whilst saving both computational resources and bandwidth. Our performance evaluation shows that FlexRel is able to achieve higher pruning factors, saving over 35% bandwidth for typical accuracy targets.
