Table of Contents
Fetching ...

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Jiawei Lin, Yuanlong Li, Guokai Chen, Thomas Bourgeat

TL;DR

This work tackles underutilization of FlashAttention on modern systolic-array accelerators by proposing FSA, an enhanced array capable of executing the entire attention pipeline without external vector units. Building on FSA, the authors introduce SystolicAttention to maximize fine-grained overlap of FlashAttention operations while preserving the original FP operation order. The RTL implementation achieves 1.5 GHz in 16 nm with only ~12% area overhead, and evaluation shows up to 1.77× and 4.83× FLOPs/s utilization improvements over AWS Neuron-v2 and TPUv5e, respectively, with negligible accuracy loss due to a PWL exp2 approximation. The work also provides a Python kernel programming model and an open-source stack, enabling broader experimentation and deployment on systolic accelerators for transformer workloads.

Abstract

Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails to fully utilize the compute resources of modern systolic-array-based accelerators designed for consecutive and large matrix multiplications. To fully unleash the performance potential of systolic arrays for FlashAttention, we propose FSA, an enhanced systolic array architecture that runs the entire FlashAttention on the array without external vector units. Combined with SystolicAttention, an optimized kernel for FSA that achieves fine-grained and element-wise overlapping of FlashAttention operations, FSA maximizes array utilization while preserving the original floating-point operation order of FlashAttention. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art systolic-array-based accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS Neuron-v2 and Google TPUv5e, respectively. We synthesize FSA in a 16 nm technology at 1.5 GHz, and results indicate only a 12% area overhead compared to a standard weight-stationary systolic array.

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

TL;DR

This work tackles underutilization of FlashAttention on modern systolic-array accelerators by proposing FSA, an enhanced array capable of executing the entire attention pipeline without external vector units. Building on FSA, the authors introduce SystolicAttention to maximize fine-grained overlap of FlashAttention operations while preserving the original FP operation order. The RTL implementation achieves 1.5 GHz in 16 nm with only ~12% area overhead, and evaluation shows up to 1.77× and 4.83× FLOPs/s utilization improvements over AWS Neuron-v2 and TPUv5e, respectively, with negligible accuracy loss due to a PWL exp2 approximation. The work also provides a Python kernel programming model and an open-source stack, enabling broader experimentation and deployment on systolic accelerators for transformer workloads.

Abstract

Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails to fully utilize the compute resources of modern systolic-array-based accelerators designed for consecutive and large matrix multiplications. To fully unleash the performance potential of systolic arrays for FlashAttention, we propose FSA, an enhanced systolic array architecture that runs the entire FlashAttention on the array without external vector units. Combined with SystolicAttention, an optimized kernel for FSA that achieves fine-grained and element-wise overlapping of FlashAttention operations, FSA maximizes array utilization while preserving the original floating-point operation order of FlashAttention. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art systolic-array-based accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS Neuron-v2 and Google TPUv5e, respectively. We synthesize FSA in a 16 nm technology at 1.5 GHz, and results indicate only a 12% area overhead compared to a standard weight-stationary systolic array.

Paper Structure

This paper contains 28 sections, 7 equations, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: Percentage of active time of various components in AWS NeuronCore-v2 when running FlashAttention.
  • Figure 2: Computing $C \mathrel{+}= AB$ on a weight-stationary systolic array.
  • Figure 3: FSA's architectural modifications (highlighted in red) to the standard systolic array.
  • Figure 4: On-the-fly rowmax generation.
  • Figure 5: Calculating exp2 using PEs extended with PWL capability.
  • ...and 12 more figures