A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

Andrea Belano; Yvan Tortorella; Angelo Garofalo; Luca Benini; Davide Rossi; Francesco Conti

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Davide Rossi, Francesco Conti

TL;DR

This work introduces an edge-oriented GenAI acceleration template that couples an 8-core RISC-V PULP cluster with a 24×8 BF16 tensor unit and a novel SoftEx accelerator for softmax and GELU. By replacing expensive exponentiation with a hardware-friendly expp approximation and offloading nonlinearities to SoftEx, the design achieves up to 310 GOPS and 1.34 TOPS/W on ViT base at 0.55 V, closely mitigating the softmax/GELU bottlenecks that limit end-to-end Transformer throughput. Accuracy assessments show expp and GELU approximations maintain model performance on MobileBERT, ViT, and GPT-2 benchmarks, with negligible degradations when using 4–5 terms for GELU and 14-bit accumulators. The proposed scalable edge template demonstrates strong throughput/energy gains and competitive scaling in mesh configurations, offering a practical path to unquantized, high-accuracy edge GenAI inference without re-training or fine-tuning.

Abstract

Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24x8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax and GELU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121x speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). In 12nm technology, SoftEx occupies 0.039 mm$^2$, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8x and 5.29x. These enhancements translate into a 1.58x increase in throughput (310 GOPS at 0.8V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55V) on end-to-end ViT inference workloads.

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

TL;DR

Abstract

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)