SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Jiawei Lin; Yuanlong Li; Guokai Chen; Thomas Bourgeat

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Jiawei Lin, Yuanlong Li, Guokai Chen, Thomas Bourgeat

TL;DR

This work tackles underutilization of FlashAttention on modern systolic-array accelerators by proposing FSA, an enhanced array capable of executing the entire attention pipeline without external vector units. Building on FSA, the authors introduce SystolicAttention to maximize fine-grained overlap of FlashAttention operations while preserving the original FP operation order. The RTL implementation achieves 1.5 GHz in 16 nm with only ~12% area overhead, and evaluation shows up to 1.77× and 4.83× FLOPs/s utilization improvements over AWS Neuron-v2 and TPUv5e, respectively, with negligible accuracy loss due to a PWL exp2 approximation. The work also provides a Python kernel programming model and an open-source stack, enabling broader experimentation and deployment on systolic accelerators for transformer workloads.

Abstract

Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails to fully utilize the compute resources of modern systolic-array-based accelerators designed for consecutive and large matrix multiplications. To fully unleash the performance potential of systolic arrays for FlashAttention, we propose FSA, an enhanced systolic array architecture that runs the entire FlashAttention on the array without external vector units. Combined with SystolicAttention, an optimized kernel for FSA that achieves fine-grained and element-wise overlapping of FlashAttention operations, FSA maximizes array utilization while preserving the original floating-point operation order of FlashAttention. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art systolic-array-based accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS Neuron-v2 and Google TPUv5e, respectively. We synthesize FSA in a 16 nm technology at 1.5 GHz, and results indicate only a 12% area overhead compared to a standard weight-stationary systolic array.

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

TL;DR

Abstract

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)