Table of Contents
Fetching ...

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

TL;DR

This work demonstrates an end-to-end, open-source deployment of transformer-based foundation-model inference on a many-tiny-core RISC-V platform, incorporating distributed Softmax, ISA-driven SIMD operand streaming, and DMA-accelerated dataflow. By combining FlashAttention-2-inspired kernels, layer fusion, hierarchical interconnects, and double-buffering, the authors achieve substantial speedups and high FPU utilization across encoder- and decoder-only models, including up to $12.8\times$ (encoder) and $16.1$–$35.6\times$ (decoder) improvements over baselines, with a peak FPU utilization over $79\%$ and $294\mathrm{GFLOPS/W}$. The open-source library supports FP64/FP32/FP16/FP8 precisions and demonstrates scalability across sequence lengths and cluster counts, delivering competitive or superior efficiency compared to state-of-the-art accelerators on similar tasks. These results highlight the practicality and impact of open, configurable hardware-software stacks for on-edge transformer inference, enabling flexible AI workloads beyond proprietary accelerators.

Abstract

Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

TL;DR

This work demonstrates an end-to-end, open-source deployment of transformer-based foundation-model inference on a many-tiny-core RISC-V platform, incorporating distributed Softmax, ISA-driven SIMD operand streaming, and DMA-accelerated dataflow. By combining FlashAttention-2-inspired kernels, layer fusion, hierarchical interconnects, and double-buffering, the authors achieve substantial speedups and high FPU utilization across encoder- and decoder-only models, including up to (encoder) and (decoder) improvements over baselines, with a peak FPU utilization over and . The open-source library supports FP64/FP32/FP16/FP8 precisions and demonstrates scalability across sequence lengths and cluster counts, delivering competitive or superior efficiency compared to state-of-the-art accelerators on similar tasks. These results highlight the practicality and impact of open, configurable hardware-software stacks for on-edge transformer inference, enabling flexible AI workloads beyond proprietary accelerators.

Abstract

Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.
Paper Structure (28 sections, 2 equations, 10 figures, 4 tables)

This paper contains 28 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Block topology of the basic Attention layer. The arrows are annotated with the percentage of the memory transfers needed for the specific tensor over the total number of transfers needed by the block, considering the -j model in mode and a sequence length of 2048. Red dots represent data reads from HBM of our implementation. Green ones are transfers done only at the cluster level.
  • Figure 2: Operation performed by the fundamental ViT and GPT blocks.
  • Figure 3: Architecture of the RISC-V compute cluster with ISA extension Xfrep and Xssr.
  • Figure 4: Scalable multi-cluster architecture with hierarchical heterogeneous memory interconnect.
  • Figure 5: Illustration of the spatio-temporal GEMM Tiling. A) Spatial tiling of the operation on the $M$ and $K$ dimensions. When tiling $M$, each cluster processes distinct rows of the output matrix by partitioning matrix $A$ and $C$ blocks while broadcasting matrix $B$. When tiling $K$, matrices $A$ and $B$ are partitioned, while partial C matrices to be further summed together are produced for each cluster. B) Spatial tiling on the $M$ dimension combined with temporal tiling on the $K$ dimension where $t_0$, $t_1$, ..., $t_E$ denote different time steps. Note that at each individual time step, only a single temporal tile (in the figure, green, red, and yellow rectangles) is loaded in the cluster memory.
  • ...and 5 more figures