Table of Contents
Fetching ...

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello

TL;DR

This work tackles efficient DNN inference on ultra-low-power MCUs by combining three strategies: optimized software kernels for semi-structured N:M sparsity, a lightweight ISA extension (xDecimate) to accelerate NZ-index decodings, and MATCH compiler integration to deploy these kernels in end-to-end networks. The approach yields substantial latency reductions and memory savings with minimal accuracy loss, reporting end-to-end speedups up to $3.21\times$ for CNNs and $1.81\times$ for ViT at $1:16$ sparsity, while maintaining accuracy within $1.5\%$ of the dense baselines. Layer- and kernel-level results show strong gains for both Convolution and FC layers, especially as sparsity increases, and the hardware-extension adds only about $5\%$ area overhead. Overall, the paper demonstrates practical sparse DNN execution on MCUs via a combination of software optimization, lightweight hardware support, and compiler-driven deployment, enabling energy-efficient edge AI with modest hardware cost.

Abstract

The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

TL;DR

This work tackles efficient DNN inference on ultra-low-power MCUs by combining three strategies: optimized software kernels for semi-structured N:M sparsity, a lightweight ISA extension (xDecimate) to accelerate NZ-index decodings, and MATCH compiler integration to deploy these kernels in end-to-end networks. The approach yields substantial latency reductions and memory savings with minimal accuracy loss, reporting end-to-end speedups up to for CNNs and for ViT at sparsity, while maintaining accuracy within of the dense baselines. Layer- and kernel-level results show strong gains for both Convolution and FC layers, especially as sparsity increases, and the hardware-extension adds only about area overhead. Overall, the paper demonstrates practical sparse DNN execution on MCUs via a combination of software optimization, lightweight hardware support, and compiler-driven deployment, enabling energy-efficient edge AI with modest hardware cost.

Abstract

The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.

Paper Structure

This paper contains 24 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Pruning patterns and indices compressions with 75% sparsity.
  • Figure 2: Inner loop of the PULP-NN dense convolutional kernel.
  • Figure 3: Inner loop of our sparse convolutional kernel.
  • Figure 4: Innermost iteration of the dense matmul kernel (left), 1:8 / 1:16 sparse kernel with no custom instructions (center), and 1:8 / 1:16 sparse kernel with the xDecimate instruction (right).
  • Figure 5: Innermost iteration of the dense FC kernel (left), 1:8 SW-only sparse kernel (center), and 1:8 ISA-extended sparse kernel (right).
  • ...and 3 more figures