VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers
Run Wang, Gamze Islamoglu, Andrea Belano, Viviane Potocnik, Francesco Conti, Angelo Garofalo, Luca Benini
TL;DR
This paper targets the Softmax bottleneck in Transformer attention by introducing a low-cost BF16 exponential unit based on a Schraudolph-inspired approximation and integrating it as a lightweight EXP ISA extension into the RISC-V Snitch cluster. The hardware/software co-design accelerates Softmax and related kernels, notably FlashAttention-2, achieving substantial latency and energy improvements while preserving accuracy and avoiding retraining. Key contributions include the EXP custom arithmetic block, the ExpOpGroup with FEXP/VFEXP instructions, and optimized Softmax and FlashAttention-2 kernels, yielding up to 162.7× speedups and up to 5.8× end-to-end latency reductions on multi-cluster Transformers. The approach demonstrates scalable, end-to-end Transformer inference on resource-constrained hardware with modest area and power overhead, highlighting the practicality of RISC-V-based AI accelerators.
Abstract
While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1\%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7$\times$ less latency and 74.3$\times$ less energy compared to the baseline cluster, achieving an 8.2$\times$ performance improvement and 4.1$\times$ higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8$\times$ and 3.6$\times$ reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.
