Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Elisavet Lydia Alvanaki; Manolis Katsaragakis; Dimosthenis Masouros; Sotirios Xydis; Dimitrios Soudris

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Elisavet Lydia Alvanaki, Manolis Katsaragakis, Dimosthenis Masouros, Sotirios Xydis, Dimitrios Soudris

TL;DR

The paper tackles energy-efficient CNN inference on resource-constrained STM32 MCUs under latency constraints. It introduces a DVFS-enabled Decoupled Access Execute (DAE) framework that splits memory-bound and compute-bound work, with per-layer co-exploration of decoupling granularity and clocking. The allocation is formulated as an NP-complete optimization, cast as a Multiple-Choice Knapsack Problem and solved by dynamic programming to meet QoS. Empirical results on three TinyCNNs show energy reductions up to 25.2% versus TinyEngine baselines, validating practical benefits for tinyML on the edge.

Abstract

Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

TL;DR

Abstract

Paper Structure (9 sections, 1 equation, 6 figures)

This paper contains 9 sections, 1 equation, 6 figures.

Introduction
Clocking Scheme of STM32 Microcontrollers
Considerations of SYSCLK Frequency Scaling
Decoupled Access Execute enabled DVFS on STM32 MCUs
Step 1: Memory Access & CPU Execution Decoupling
Step 2: DAE and Clocking Co-exploration
Step 3: QoS-aware Energy Optimization
Experimental Setup and Evaluation
Conclusion

Figures (6)

Figure 1: Simplified circuit diagram for clock configuration through HSE and PLL parameters.
Figure 2: Clock Frequency and Power for different HSE, PLLM and PLLN configurations.
Figure 3: Overview of the proposed methodology.
Figure 4: Impact of different DAE and clocking configurations on latency and power of depthwise and pointwise layers.
Figure 5: Energy consumption gains of our approach over the TinyEngine lin2020mcunet baseline. We compare against TinyEngine with Clock Gating over the examined CNN models.
...and 1 more figures

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

TL;DR

Abstract

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)