Table of Contents
Fetching ...

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Davide Rossi, Francesco Conti

TL;DR

This work introduces an edge-oriented GenAI acceleration template that couples an 8-core RISC-V PULP cluster with a 24×8 BF16 tensor unit and a novel SoftEx accelerator for softmax and GELU. By replacing expensive exponentiation with a hardware-friendly expp approximation and offloading nonlinearities to SoftEx, the design achieves up to 310 GOPS and 1.34 TOPS/W on ViT base at 0.55 V, closely mitigating the softmax/GELU bottlenecks that limit end-to-end Transformer throughput. Accuracy assessments show expp and GELU approximations maintain model performance on MobileBERT, ViT, and GPT-2 benchmarks, with negligible degradations when using 4–5 terms for GELU and 14-bit accumulators. The proposed scalable edge template demonstrates strong throughput/energy gains and competitive scaling in mesh configurations, offering a practical path to unquantized, high-accuracy edge GenAI inference without re-training or fine-tuning.

Abstract

Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24x8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax and GELU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121x speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). In 12nm technology, SoftEx occupies 0.039 mm$^2$, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8x and 5.29x. These enhancements translate into a 1.58x increase in throughput (310 GOPS at 0.8V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55V) on end-to-end ViT inference workloads.

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

TL;DR

This work introduces an edge-oriented GenAI acceleration template that couples an 8-core RISC-V PULP cluster with a 24×8 BF16 tensor unit and a novel SoftEx accelerator for softmax and GELU. By replacing expensive exponentiation with a hardware-friendly expp approximation and offloading nonlinearities to SoftEx, the design achieves up to 310 GOPS and 1.34 TOPS/W on ViT base at 0.55 V, closely mitigating the softmax/GELU bottlenecks that limit end-to-end Transformer throughput. Accuracy assessments show expp and GELU approximations maintain model performance on MobileBERT, ViT, and GPT-2 benchmarks, with negligible degradations when using 4–5 terms for GELU and 14-bit accumulators. The proposed scalable edge template demonstrates strong throughput/energy gains and competitive scaling in mesh configurations, offering a practical path to unquantized, high-accuracy edge GenAI inference without re-training or fine-tuning.

Abstract

Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24x8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax and GELU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121x speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). In 12nm technology, SoftEx occupies 0.039 mm, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8x and 5.29x. These enhancements translate into a 1.58x increase in throughput (310 GOPS at 0.8V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55V) on end-to-end ViT inference workloads.

Paper Structure

This paper contains 40 sections, 19 equations, 15 figures, 2 tables, 2 algorithms.

Figures (15)

  • Figure 1: Breakdown of one of ViT's layers' runtime running on a 8 core PULP cluster enhanced with tensor processing units of various dimensions.
  • Figure 2: The circuit implementing the correction proposed in Section \ref{['expu']}, assuming a 7-bit mantissa in BFloat16.
  • Figure 3: Architecture of the enhanced PULP cluster proposed in this work. External connections are not shown for simplicity.
  • Figure 4: A detailed view of SoftEx and its Datapath. In the left image, the paths used in the calculation of Softmax are highlighted, with paths used in the accumulation step highlighted in blue, those used in the normalization step highlighted in red, and those used in both steps highlighted in purple. In the right image, paths used in the sum of exponentials calculation are highlighted in orange. Unused paths in a mode are grayed out.
  • Figure 5: The effects of changing the number of bits in the lane accumulators and number of terms in the sum of exponentials. From left to right: the number of mismatches in the predicted labels and the mean squared error (MSE) of the output logits of ViT on ImageNet1k, and the perplexity of GPT-2 on the WikiText-2 dataset. For ViT, the number of mismatches and the MSE are defined with respect to a model in which both the exponential function and GELU are computed using accurate methods. The dashed red line on the rightmost plot represents the perplexity of GPT-2 when both the exponential function and GELU are calculated with accurate methods.
  • ...and 10 more figures