Table of Contents
Fetching ...

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Elisavet Lydia Alvanaki, Manolis Katsaragakis, Dimosthenis Masouros, Sotirios Xydis, Dimitrios Soudris

TL;DR

The paper tackles energy-efficient CNN inference on resource-constrained STM32 MCUs under latency constraints. It introduces a DVFS-enabled Decoupled Access Execute (DAE) framework that splits memory-bound and compute-bound work, with per-layer co-exploration of decoupling granularity and clocking. The allocation is formulated as an NP-complete optimization, cast as a Multiple-Choice Knapsack Problem and solved by dynamic programming to meet QoS. Empirical results on three TinyCNNs show energy reductions up to 25.2% versus TinyEngine baselines, validating practical benefits for tinyML on the edge.

Abstract

Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

TL;DR

The paper tackles energy-efficient CNN inference on resource-constrained STM32 MCUs under latency constraints. It introduces a DVFS-enabled Decoupled Access Execute (DAE) framework that splits memory-bound and compute-bound work, with per-layer co-exploration of decoupling granularity and clocking. The allocation is formulated as an NP-complete optimization, cast as a Multiple-Choice Knapsack Problem and solved by dynamic programming to meet QoS. Empirical results on three TinyCNNs show energy reductions up to 25.2% versus TinyEngine baselines, validating practical benefits for tinyML on the edge.

Abstract

Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.
Paper Structure (9 sections, 1 equation, 6 figures)

This paper contains 9 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Simplified circuit diagram for clock configuration through HSE and PLL parameters.
  • Figure 2: Clock Frequency and Power for different HSE, PLLM and PLLN configurations.
  • Figure 3: Overview of the proposed methodology.
  • Figure 4: Impact of different DAE and clocking configurations on latency and power of depthwise and pointwise layers.
  • Figure 5: Energy consumption gains of our approach over the TinyEngine lin2020mcunet baseline. We compare against TinyEngine with Clock Gating over the examined CNN models.
  • ...and 1 more figures