Table of Contents
Fetching ...

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

Philip Wiese, Gamze İslamoğlu, Moritz Scherer, Luka Macan, Victor J. B. Jung, Alessio Burrello, Francesco Conti, Luca Benini

TL;DR

The paper tackles enabling Attention-based Transformer inference on ultra-low-power edge devices by introducing a flexible hardware-software template that couples a latency-tolerant RISC-V Snitch compute cluster with a dedicated Transformer accelerator (ITA) over shared L1 memory, plus a bottom-up deployment flow via Deeploy. The authors design and implement the ITA, integrate it into the hardware template, and extend Deeploy to map Transformer workloads efficiently, including head-by-head tiling and double-buffered dataflows. They demonstrate end-to-end 8-bit Transformer inference on MobileBERT, DINOv2, and Whisper encoder, achieving up to 154 GOp/s and 2.96 TJ energy efficiency, with substantial improvements over a baseline cluster and competitive state-of-the-art results. This work advances practical TinyML by enabling high-throughput, energy-efficient Transformer inference at edge with automated deployment, broad model compatibility, and scalable hardware-software co-design.

Abstract

One of the challenges for Tiny Machine Learning (tinyML) is keeping up with the evolution of Machine Learning models from Convolutional Neural Networks to Transformers. We address this by leveraging a heterogeneous architectural template coupling RISC-V processors with hardwired accelerators supported by an automated deployment flow. We demonstrate Attention-based models in a tinyML power envelope with an octa-core cluster coupled with an accelerator for quantized Attention. Our deployment flow enables end-to-end 8-bit Transformer inference, achieving leading-edge energy efficiency and throughput of 2960 GOp/J and 154 GOp/s (0.65 V, 22 nm FD-SOI technology).

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

TL;DR

The paper tackles enabling Attention-based Transformer inference on ultra-low-power edge devices by introducing a flexible hardware-software template that couples a latency-tolerant RISC-V Snitch compute cluster with a dedicated Transformer accelerator (ITA) over shared L1 memory, plus a bottom-up deployment flow via Deeploy. The authors design and implement the ITA, integrate it into the hardware template, and extend Deeploy to map Transformer workloads efficiently, including head-by-head tiling and double-buffered dataflows. They demonstrate end-to-end 8-bit Transformer inference on MobileBERT, DINOv2, and Whisper encoder, achieving up to 154 GOp/s and 2.96 TJ energy efficiency, with substantial improvements over a baseline cluster and competitive state-of-the-art results. This work advances practical TinyML by enabling high-throughput, energy-efficient Transformer inference at edge with automated deployment, broad model compatibility, and scalable hardware-software co-design.

Abstract

One of the challenges for Tiny Machine Learning (tinyML) is keeping up with the evolution of Machine Learning models from Convolutional Neural Networks to Transformers. We address this by leveraging a heterogeneous architectural template coupling RISC-V processors with hardwired accelerators supported by an automated deployment flow. We demonstrate Attention-based models in a tinyML power envelope with an octa-core cluster coupled with an accelerator for quantized Attention. Our deployment flow enables end-to-end 8-bit Transformer inference, achieving leading-edge energy efficiency and throughput of 2960 GOp/J and 154 GOp/s (0.65 V, 22 nm FD-SOI technology).
Paper Structure (17 sections, 2 equations, 2 figures, 1 table)

This paper contains 17 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the Hardware-Software Architecture Template. The flexible template allows modular integration of accelerators into an and deployment of different workloads with Deeploy. The workflow is as follows: Integrate an accelerator as an engine, a configurable interface designed for efficient integration of memory-coupled accelerators, enabling streamlined data transfer and control between the accelerator and shared memory. Ensure sufficient bandwidth for the accelerator by tuning the wide interconnect, allowing high-bandwidth access to L2 memory via the . Configure the operator mapping in Deepooy and provide the workload as an graph. Define the tiling constraints according to the accelerator buffer and datapath sizes and provide minimal kernel templates to control the accelerator via a register interface. Use Deeploy to perform automated graph optimization and scheduling, to co-optimize operator tiling and static memory allocation, and to generate C code. This code orchestrates memory transfers using the and coordinates execution on the compute cores and the accelerator.
  • Figure 2: Architecture of the Integer Transformer Accelerator (ITA). ITA combines an output stationary dataflow with a local weight stationary dataflow and streaming Softmax operation to achieve high data reuse and minimal memory interaction. Weights are stored in a double-buffered weight memory to fetch the next set of weights while performing computation with the current set of weights. Inputs are fetched via streamers and passed through the ITAMax module during $\mathbf{A \times V}$ step. While $\mathbf{Q} \times \mathbf{K}^\mathrm{T}$ is computed, the ITAMax module operates on the outputs to accumulate the denominator. ITAMax operates in three stages: Find the local maximum and compare it with the previous maximum stored in the buffer, accumulate the denominator of the Softmax using the current maximum and normalize the previous sum if the maximum is changed. After the accumulation, the denominator is inverted and saved to the same buffer. Inputs for $\mathbf{A \times V}$ step are normalized using the saved maximum and inverted denominator.