Table of Contents
Fetching ...

Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors

Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos

TL;DR

This work tackles the high computational cost of attention by presenting a fully vectorized FlashAttention kernel for RISCV vector processors. It introduces a low-cost exponential approximation and a tiling strategy to maximize data locality, enabling scalable vectorized execution without custom ISA extensions. Experimental results on gem5-based RISCV hardware show significant speedups over scalar baselines while preserving accuracy on LLM-style tasks. The approach broadens accessible, energy-efficient attention acceleration to resource-constrained hardware and supports open-source exploration of vectorized transformer workloads.

Abstract

Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.

Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors

TL;DR

This work tackles the high computational cost of attention by presenting a fully vectorized FlashAttention kernel for RISCV vector processors. It introduces a low-cost exponential approximation and a tiling strategy to maximize data locality, enabling scalable vectorized execution without custom ISA extensions. Experimental results on gem5-based RISCV hardware show significant speedups over scalar baselines while preserving accuracy on LLM-style tasks. The approach broadens accessible, energy-efficient attention acceleration to resource-constrained hardware and supports open-source exploration of vectorized transformer workloads.

Abstract

Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.

Paper Structure

This paper contains 18 sections, 10 equations, 4 figures, 3 tables, 6 algorithms.

Figures (4)

  • Figure 1: The diagram above illustrates how Attention is computed in blocks. The blue-colored blocks represent the tiled implementation of Attention mechanism. The black-colored blocks correspond to the k-th row of the attention output. N denotes the sequence length, d denotes the head dimension and Br and Bc are the block sizes that can be controlled.
  • Figure 2: The speedup achieved through the vectorization of Alg. \ref{['alg:fa-d']} for $B_r = 1$ and vector length 32, compared to each scalar implementation with the same exact configuration for sequence length and head dimension.
  • Figure 3: The speedup achieved by the algorithm presented in Alg. \ref{['alg:fa-d']} through the use of a $B_r$ parameter greater than 1, compared to the implementation with $B_r = 1$, is evaluated across varying sequence lengths and the head dimensions of the Qwen-1.5B and Gemma2-2B LLMs, with a fixed vector length of 32.
  • Figure 4: The speedup achieved by Alg. \ref{['alg:fa-d']} for three vector length configurations (32, 64, and 128), with the $B_r$ parameter set to 32, is presented. All the results are normalized with respect to the configuration using a vector length of 32.