Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers
Moritz Scherer, Luka Macan, Victor Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, Luca Benini
TL;DR
Deeploy tackles the challenge of end-to-end deployment of small language models on MCU-class devices by introducing a bottom-up compiler that co-optimizes tiling and memory allocation across a heterogeneous, multi-accelerator edge platform. The framework's Frontend/Midend/Backend architecture, along with memory-level annotations and a coupled tiling/memory scheduling strategy, enables efficient on-chip SLM inference with KV caching. Demonstrated on Siracusa, a multi-accelerator RISC-V MCU, the approach achieves 340 tokens per second at 490 μJ per token for a TinyStories-class Llama model, with strong data-movement and compilation-time characteristics. The results show notable gains in energy efficiency, throughput, and scalability compared to existing tinyML tools, highlighting Deeploy’s potential for broad deployment of embodied foundation models at the edge.
Abstract
With the rise of Embodied Foundation Models (EFMs), most notably Small Language Models (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU: We achieve leading-edge energy and throughput of \SI{490}{\micro\joule \per Token}, at \SI{340}{Token \per \second} for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without external memory.
