Attention in SRAM on Tenstorrent Grayskull
Moritz Thüning
TL;DR
This work targets accelerating Transformer self-attention on the Tenstorrent Grayskull SRAM-based accelerator. It introduces a dedicated Softmax kernel and a fused matrix multiply + scaling + Softmax kernel that execute primarily in on-chip SRAM to minimize DRAM traffic, achieving substantial speedups over CPU baselines (Softmax up to ~10×; fused Softmax up to ~1.8× over the dedicated Softmax) while retaining Θ(n^2) time and memory complexity. Through a detailed hardware/software stack description (TT-Buda/TT-Metalium), and a tile-based dataflow across the Tensix core grid, the authors demonstrate significant performance gains for attention computations with manageable memory demands; a fused kernel further reduces kernel-dispatch overhead and DRAM accesses. The work highlights practical improvements in cost and SRAM capacity on Grayskull (e.g., Grayskull e150 is far cheaper than H100 with competitive SRAM) and points to future directions for integrating remaining matmuls and scaling across cards.
Abstract
When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.
