Table of Contents
Fetching ...

Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Severin Bochem, Victor J. B. Jung, Arpan Prasad, Francesco Conti, Luca Benini

TL;DR

This work tackles the challenge of running Transformer-based models on memory-constrained edge devices by introducing a tensor-parallel, non-replicating partitioning scheme that distributes inference across a cluster of low-power MCUs (Siracusa). The method partitions MHSA across chips along the head dimension and slices the FFN layers without duplicating weights, requiring only two synchronization events per Transformer block and using hierarchical all-reduce to combine partial results, thereby minimizing off-chip memory accesses. Evaluations on TinyLlama-42M and MobileBERT show substantial improvements in latency and energy efficiency, including a 26.1x autoregressive speedup with 8 chips and a 60.1x speedup at 64 chips for scaled TinyLlama, as well as a 4.7x speedup for MobileBERT with 4 chips; energy per inference is reduced due to limited off-chip traffic. The results demonstrate scalable, real-time edge inference for wearable devices like smart glasses, enabling on-device intelligence without relying on heavy off-chip memory or cloud computing.

Abstract

Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.

Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

TL;DR

This work tackles the challenge of running Transformer-based models on memory-constrained edge devices by introducing a tensor-parallel, non-replicating partitioning scheme that distributes inference across a cluster of low-power MCUs (Siracusa). The method partitions MHSA across chips along the head dimension and slices the FFN layers without duplicating weights, requiring only two synchronization events per Transformer block and using hierarchical all-reduce to combine partial results, thereby minimizing off-chip memory accesses. Evaluations on TinyLlama-42M and MobileBERT show substantial improvements in latency and energy efficiency, including a 26.1x autoregressive speedup with 8 chips and a 60.1x speedup at 64 chips for scaled TinyLlama, as well as a 4.7x speedup for MobileBERT with 4 chips; energy per inference is reduced due to limited off-chip traffic. The results demonstrate scalable, real-time edge inference for wearable devices like smart glasses, enabling on-device intelligence without relying on heavy off-chip memory or cloud computing.

Abstract

Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.

Paper Structure

This paper contains 14 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Hierarchical interconnection of the Siracusa chips in the proposed system. Chips are placed in groups of four for improved scalability of the system. We use for the chip-to-chip link.
  • Figure 2: Overview of the generic Siracusa including an octa-core RISC-V cluster and host controller (red), memory hierarchy with two levels of scratchpad memory, two arbitrated interconnects towards the L1 memory and an interconnect (green), and peripherals such as the cluster and chip-level I/O (purple). Note that the image does not depict the N-EUREKA accelerator, as it was not used in this work.
  • Figure 3: Partitioning of Transformer Inference for three chips. Tensor colorings indicate on which chip a tensor is present. Tensors with grey coloring are present in all chips. Softmax and Norm are shown in orange.
  • Figure 4: Results of the MobileBERT model and TinyLlama model in prompt and autoregressive modes. The lines indicate the speedup when using $1$-$8$ for TinyLlama or $1$-$4$ for MobileBERT compared to a single-chip system. The bar plot shows the breakdown of runtime into computation, chip-to-chip communication, and access to L2 and L3 memory.
  • Figure 5: This figure depicts runtimes and energies of TinyLlama in autoregressive mode (left), TinyLlama in prompt mode (middle), and MobileBERT (right). Red crosses are results obtained for models in their default configuration, whereas red circles show results for the scaled-up models.
  • ...and 1 more figures