Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow
Philip Wiese, Gamze İslamoğlu, Moritz Scherer, Luka Macan, Victor J. B. Jung, Alessio Burrello, Francesco Conti, Luca Benini
TL;DR
The paper tackles enabling Attention-based Transformer inference on ultra-low-power edge devices by introducing a flexible hardware-software template that couples a latency-tolerant RISC-V Snitch compute cluster with a dedicated Transformer accelerator (ITA) over shared L1 memory, plus a bottom-up deployment flow via Deeploy. The authors design and implement the ITA, integrate it into the hardware template, and extend Deeploy to map Transformer workloads efficiently, including head-by-head tiling and double-buffered dataflows. They demonstrate end-to-end 8-bit Transformer inference on MobileBERT, DINOv2, and Whisper encoder, achieving up to 154 GOp/s and 2.96 TJ energy efficiency, with substantial improvements over a baseline cluster and competitive state-of-the-art results. This work advances practical TinyML by enabling high-throughput, energy-efficient Transformer inference at edge with automated deployment, broad model compatibility, and scalable hardware-software co-design.
Abstract
One of the challenges for Tiny Machine Learning (tinyML) is keeping up with the evolution of Machine Learning models from Convolutional Neural Networks to Transformers. We address this by leveraging a heterogeneous architectural template coupling RISC-V processors with hardwired accelerators supported by an automated deployment flow. We demonstrate Attention-based models in a tinyML power envelope with an octa-core cluster coupled with an accelerator for quantized Attention. Our deployment flow enables end-to-end 8-bit Transformer inference, achieving leading-edge energy efficiency and throughput of 2960 GOp/J and 154 GOp/s (0.65 V, 22 nm FD-SOI technology).
