Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello
TL;DR
This work tackles efficient DNN inference on ultra-low-power MCUs by combining three strategies: optimized software kernels for semi-structured N:M sparsity, a lightweight ISA extension (xDecimate) to accelerate NZ-index decodings, and MATCH compiler integration to deploy these kernels in end-to-end networks. The approach yields substantial latency reductions and memory savings with minimal accuracy loss, reporting end-to-end speedups up to $3.21\times$ for CNNs and $1.81\times$ for ViT at $1:16$ sparsity, while maintaining accuracy within $1.5\%$ of the dense baselines. Layer- and kernel-level results show strong gains for both Convolution and FC layers, especially as sparsity increases, and the hardware-extension adds only about $5\%$ area overhead. Overall, the paper demonstrates practical sparse DNN execution on MCUs via a combination of software optimization, lightweight hardware support, and compiler-driven deployment, enabling energy-efficient edge AI with modest hardware cost.
Abstract
The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.
