Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs
Severin Bochem, Victor J. B. Jung, Arpan Prasad, Francesco Conti, Luca Benini
TL;DR
This work tackles the challenge of running Transformer-based models on memory-constrained edge devices by introducing a tensor-parallel, non-replicating partitioning scheme that distributes inference across a cluster of low-power MCUs (Siracusa). The method partitions MHSA across chips along the head dimension and slices the FFN layers without duplicating weights, requiring only two synchronization events per Transformer block and using hierarchical all-reduce to combine partial results, thereby minimizing off-chip memory accesses. Evaluations on TinyLlama-42M and MobileBERT show substantial improvements in latency and energy efficiency, including a 26.1x autoregressive speedup with 8 chips and a 60.1x speedup at 64 chips for scaled TinyLlama, as well as a 4.7x speedup for MobileBERT with 4 chips; energy per inference is reduced due to limited off-chip traffic. The results demonstrate scalable, real-time edge inference for wearable devices like smart glasses, enabling on-device intelligence without relying on heavy off-chip memory or cloud computing.
Abstract
Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.
