Table of Contents
Fetching ...

VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Run Wang, Gamze Islamoglu, Andrea Belano, Viviane Potocnik, Francesco Conti, Angelo Garofalo, Luca Benini

TL;DR

This paper targets the Softmax bottleneck in Transformer attention by introducing a low-cost BF16 exponential unit based on a Schraudolph-inspired approximation and integrating it as a lightweight EXP ISA extension into the RISC-V Snitch cluster. The hardware/software co-design accelerates Softmax and related kernels, notably FlashAttention-2, achieving substantial latency and energy improvements while preserving accuracy and avoiding retraining. Key contributions include the EXP custom arithmetic block, the ExpOpGroup with FEXP/VFEXP instructions, and optimized Softmax and FlashAttention-2 kernels, yielding up to 162.7× speedups and up to 5.8× end-to-end latency reductions on multi-cluster Transformers. The approach demonstrates scalable, end-to-end Transformer inference on resource-constrained hardware with modest area and power overhead, highlighting the practicality of RISC-V-based AI accelerators.

Abstract

While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1\%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7$\times$ less latency and 74.3$\times$ less energy compared to the baseline cluster, achieving an 8.2$\times$ performance improvement and 4.1$\times$ higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8$\times$ and 3.6$\times$ reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

TL;DR

This paper targets the Softmax bottleneck in Transformer attention by introducing a low-cost BF16 exponential unit based on a Schraudolph-inspired approximation and integrating it as a lightweight EXP ISA extension into the RISC-V Snitch cluster. The hardware/software co-design accelerates Softmax and related kernels, notably FlashAttention-2, achieving substantial latency and energy improvements while preserving accuracy and avoiding retraining. Key contributions include the EXP custom arithmetic block, the ExpOpGroup with FEXP/VFEXP instructions, and optimized Softmax and FlashAttention-2 kernels, yielding up to 162.7× speedups and up to 5.8× end-to-end latency reductions on multi-cluster Transformers. The approach demonstrates scalable, end-to-end Transformer inference on resource-constrained hardware with modest area and power overhead, highlighting the practicality of RISC-V-based AI accelerators.

Abstract

While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1\%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7 less latency and 74.3 less energy compared to the baseline cluster, achieving an 8.2 performance improvement and 4.1 higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8 and 3.6 reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

Paper Structure

This paper contains 19 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Runtime breakdown for GPT-3 on a RISC-V multi-cluster platform potocnik_optimizing_2024. For each sequence length, the left bar shows unoptimized results, while the right bar reflects optimized results.
  • Figure 2: Architecture of the RISC-V compute cluster with extension FREP and SSR zaruba_snitch_2021.
  • Figure 3: Block diagram of (a) the extended , (b) the ExpOpGroup, (c) the ExpUnit, (d) the $exps(x)$ stage, and (e) the $P(x)$ stage.
  • Figure 4: Code comparison of Baseline and Optimized Softmax implementations. Baseline Softmax uses a piecewise polynomial approximation with software LUTs for the exponential (EXP) function, explicitly handling overflow to infinity and subnormals. The notation frep n_frep, n_instr represents a loop executing the following n_instr instructions for n_frep iterations. All v instructions in the code are packed-SIMD operations.
  • Figure 5: Area breakdown of the Snitch cluster. BL: Baseline, EXP: Extended FPU with the EXP block.
  • ...and 3 more figures