Accelerating TinyML Inference on Microcontrollers through Approximate Kernels
Giorgos Armeniakos, Georgios Mentzos, Dimitrios Soudris
TL;DR
The paper addresses the latency and memory constraints of CNN inference on MCU-based TinyML systems. It proposes a cooperative framework that combines layer-based kernel unpacking, compile-time fixed-weight optimization, and offline significance-aware computation skipping guided by design space exploration to identify Pareto-optimal latency-accuracy trade-offs. The framework reduces flash usage and runtime overhead, enabling faster inference with minimal or zero accuracy loss on CIFAR-10 CNNs evaluated on an STM32 MCU, and it outperforms or matches state-of-the-art libraries and compilers in several scenarios. This work demonstrates that approximate computing, when tightly integrated with MCU kernels, can widen the feasible model complexity and enable real-time TinyML deployments on constrained devices.
Abstract
The rapid growth of microcontroller-based IoT devices has opened up numerous applications, from smart manufacturing to personalized healthcare. Despite the widespread adoption of energy-efficient microcontroller units (MCUs) in the Tiny Machine Learning (TinyML) domain, they still face significant limitations in terms of performance and memory (RAM, Flash). In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on MCUs. Our kernel-based approximation framework firstly unpacks the operands of each convolution layer and then conducts an offline calculation to determine the significance of each operand. Subsequently, through a design space exploration, it employs a computation skipping approximation strategy based on the calculated significance. Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our Pareto optimal solutions can feature on average 21% latency reduction with no degradation in Top-1 classification accuracy, while for lower accuracy requirements, the corresponding reduction becomes even more pronounced.
