Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos
TL;DR
This work tackles the high computational cost of attention by presenting a fully vectorized FlashAttention kernel for RISCV vector processors. It introduces a low-cost exponential approximation and a tiling strategy to maximize data locality, enabling scalable vectorized execution without custom ISA extensions. Experimental results on gem5-based RISCV hardware show significant speedups over scalar baselines while preserving accuracy on LLM-style tasks. The approach broadens accessible, energy-efficient attention acceleration to resource-constrained hardware and supports open-source exploration of vectorized transformer workloads.
Abstract
Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.
