Table of Contents
Fetching ...

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Moritz Scherer, Luka Macan, Victor Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, Luca Benini

TL;DR

Deeploy tackles the challenge of end-to-end deployment of small language models on MCU-class devices by introducing a bottom-up compiler that co-optimizes tiling and memory allocation across a heterogeneous, multi-accelerator edge platform. The framework's Frontend/Midend/Backend architecture, along with memory-level annotations and a coupled tiling/memory scheduling strategy, enables efficient on-chip SLM inference with KV caching. Demonstrated on Siracusa, a multi-accelerator RISC-V MCU, the approach achieves 340 tokens per second at 490 μJ per token for a TinyStories-class Llama model, with strong data-movement and compilation-time characteristics. The results show notable gains in energy efficiency, throughput, and scalability compared to existing tinyML tools, highlighting Deeploy’s potential for broad deployment of embodied foundation models at the edge.

Abstract

With the rise of Embodied Foundation Models (EFMs), most notably Small Language Models (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU: We achieve leading-edge energy and throughput of \SI{490}{\micro\joule \per Token}, at \SI{340}{Token \per \second} for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without external memory.

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

TL;DR

Deeploy tackles the challenge of end-to-end deployment of small language models on MCU-class devices by introducing a bottom-up compiler that co-optimizes tiling and memory allocation across a heterogeneous, multi-accelerator edge platform. The framework's Frontend/Midend/Backend architecture, along with memory-level annotations and a coupled tiling/memory scheduling strategy, enables efficient on-chip SLM inference with KV caching. Demonstrated on Siracusa, a multi-accelerator RISC-V MCU, the approach achieves 340 tokens per second at 490 μJ per token for a TinyStories-class Llama model, with strong data-movement and compilation-time characteristics. The results show notable gains in energy efficiency, throughput, and scalability compared to existing tinyML tools, highlighting Deeploy’s potential for broad deployment of embodied foundation models at the edge.

Abstract

With the rise of Embodied Foundation Models (EFMs), most notably Small Language Models (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU: We achieve leading-edge energy and throughput of \SI{490}{\micro\joule \per Token}, at \SI{340}{Token \per \second} for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without external memory.
Paper Structure (30 sections, 4 equations, 8 figures, 3 tables)

This paper contains 30 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of the Deeploy Execution Flow. Steps ① and steps ② are part of the Frontend. In the first step, the graph is modified by fusing and inserting platform-specific operators, for example, transposition operators, to match data layout requirements. In the second step, datatypes for every tensor are inferred, the accelerator target is chosen, and kernel templates are selected. The first step in the Midend, step ③, is the Tile Constraint Flow, which computes geometrical constraints for the tile sizes of each tensor, adding them to a . The resulting tensor size variables are translated into a 2D bin packing problem in step ④. The solution of the co-constrained tiling and static memory allocation problem is computed by the ORTools CP-SAT solver and finally processed in step ⑤ in the Backend. Step ⑤ generates platform-specific C Code exploiting transfers. Each step of the execution flow is highly configurable through the Deployment Platform object.
  • Figure 2: Example of the co-optimization of tiling and static memory allocation algorithm for one memory level in Deeploy. First, the lifetime of each tensor in the graph is calculated under the execution schedule shown on the left. Next, the memory scheduler constructs an adjacency matrix of the tensor graph and extracts the cost vector from the tile constraint flow shown in the middle. Finally, Deeploy applies a coordinate transform within the . On the right-hand side, the 2D bin packing solution is presented with the naive solution on top, and the solution found by Deeploy is shown below.
  • Figure 3: Bottom-up offloading closure generation for a GEMV kernel. All arguments that refer to non-global Variable Buffers or Constant Buffers are captured and used to generate a closure struct typedef and a closure function that unpacks the argument struct and calls the original kernel. Finally, the kernel template is replaced with a function pi_cl_team_fork, which takes the newly generated closure as an argument and offloads its execution to all eight cluster cores.
  • Figure 4: Overview of the Llama model deployed in this work. The eight decoder layers of the model are shown on the left and consist of an RMSNorm - Self-Attention - RMSNorm - Feed-Forward layer stack. Input ① in the self-attention inset corresponds to the token input. Input ② corresponds to the rotational embedding used in Llama models. Input ③ are the $KV$ cache inputs used during autoregressive inference. Notably, during autoregressive inference, the new row of the $K$ and $V$ matrices computed on the input token are appended to the $KV$ cache.
  • Figure 5: Overview of the Siracusa featuring its -enhanced octa-core RISC-V cluster and host controller (red), NPU (orange), complex memory hierarchy with two levels of scratchpad memory and a Neural Memory Subsystem (blue), two arbitrated interconnects towards the L1 memory and an interconnect (green), and peripherals such as the cluster and chip-level I/O (purple).
  • ...and 3 more figures