Table of Contents
Fetching ...

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Victor J. B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini

TL;DR

The paper tackles deploying encoder Tiny Transformer models on extreme-edge MCUs by presenting an end-to-end deployment framework that optimizes attention-style computation. It introduces Fused-Weight Self-Attention to offline‑collapse Q/K projections, and a Depth-First Tiling strategy to limit memory peaks, paired with a tailored kernel library for multi-core MCUs. Quantization via QuantLib and platform-aware deployment (DumpO for ARM, DORY for GAP9) enable efficient cross-platform execution of three Tiny Transformer tasks (hand-gesture, EEG seizure, ECG arrhythmia) with substantial latency reductions and energy savings compared to state-of-the-art libraries. The framework demonstrates up to 4.79x/2.0x lower latency on ARM/RISC-V targets and memory-peak reductions up to 6.19x, highlighting a practical path to real-time Tiny Transformers at the edge and offering open-source tooling for broader impact.

Abstract

Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

TL;DR

The paper tackles deploying encoder Tiny Transformer models on extreme-edge MCUs by presenting an end-to-end deployment framework that optimizes attention-style computation. It introduces Fused-Weight Self-Attention to offline‑collapse Q/K projections, and a Depth-First Tiling strategy to limit memory peaks, paired with a tailored kernel library for multi-core MCUs. Quantization via QuantLib and platform-aware deployment (DumpO for ARM, DORY for GAP9) enable efficient cross-platform execution of three Tiny Transformer tasks (hand-gesture, EEG seizure, ECG arrhythmia) with substantial latency reductions and energy savings compared to state-of-the-art libraries. The framework demonstrates up to 4.79x/2.0x lower latency on ARM/RISC-V targets and memory-peak reductions up to 6.19x, highlighting a practical path to real-time Tiny Transformers at the edge and offering open-source tooling for broader impact.

Abstract

Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.
Paper Structure (36 sections, 13 equations, 12 figures, 7 tables)

This paper contains 36 sections, 13 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Topology of the Transformer block, composed of a stage and a Fully-Connected stage. The dimensions of the tensors are indicated in red.
  • Figure 2: Overview of our Tiny Transformers deployment flow. The floating point Pytorch model in can be transformed by the node in and is then fed to QuantLib . Afterward, the quantized graph is ingested by the deployment frameworks enhanced with our library and optimization . Finally, in , the generated C code is deployed on the desired platforms.
  • Figure 3: Diagram of the Classical Attention used in the and the proposed Fused-Weight Attention. The dimension of each tensor is specified in red.
  • Figure 4: Number of parameters and as a function of the embedding dimension $E$ and for $S = 32$, $P = 32$ and $H = 8$. The intersection points happen at $E=52$ and $E=64$ for the number of and the number of parameters, respectively.
  • Figure 5: Linear layer dataflow for generating Q, K, and V. The output data layout is $HPS$ and $HSP$. Matrices are filled from top left to bottom right.
  • ...and 7 more figures