Attention in SRAM on Tenstorrent Grayskull

Moritz Thüning

Attention in SRAM on Tenstorrent Grayskull

Moritz Thüning

TL;DR

This work targets accelerating Transformer self-attention on the Tenstorrent Grayskull SRAM-based accelerator. It introduces a dedicated Softmax kernel and a fused matrix multiply + scaling + Softmax kernel that execute primarily in on-chip SRAM to minimize DRAM traffic, achieving substantial speedups over CPU baselines (Softmax up to ~10×; fused Softmax up to ~1.8× over the dedicated Softmax) while retaining Θ(n^2) time and memory complexity. Through a detailed hardware/software stack description (TT-Buda/TT-Metalium), and a tile-based dataflow across the Tensix core grid, the authors demonstrate significant performance gains for attention computations with manageable memory demands; a fused kernel further reduces kernel-dispatch overhead and DRAM accesses. The work highlights practical improvements in cost and SRAM capacity on Grayskull (e.g., Grayskull e150 is far cheaper than H100 with competitive SRAM) and points to future directions for integrating remaining matmuls and scaling across cards.

Abstract

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.

Attention in SRAM on Tenstorrent Grayskull

TL;DR

Abstract

, and the Softmax implementation inside the fused kernel is approximately

faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately

cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately

more SRAM.

Paper Structure (24 sections, 4 equations, 6 figures, 1 table)

This paper contains 24 sections, 4 equations, 6 figures, 1 table.

Introduction
Tenstorrent Grayskull e150
Tensix core
Tenstorrent Software
TT-Buda: high-level, top-down
TT-Metalium: low-level, bottom-up
Multi-Head Self-Attention
Matrix Multiplication on Grayskull
Softmax
Softmax on CPU
Maximum sequence length
Time and memory complexity
Effect of caching
Softmax on Grayskull
Maximum sequence length
...and 9 more sections

Figures (6)

Figure 1: Topology of the Network-on-Chip (NoC). Nodes represent Tensix cores and the edges represent bi-directional connections between them. The actual Tensix core grid of Grayskull is $10 \times 12$. It is a torus topology, since opposite ends are connected.
Figure 2: Inside a Tensix core. R1 represents the first RISC-V core. The kernels run on RISC-V cores and the cores control other components. The routers are connected to the NoC and exchange data via Buffers in SRAM with the engines.
Figure 3: Effect of caching the exponentials. Note, that the two y-axes have different scales. All CPU experiments were conducted on a single core of an Intel i5-6500 processor with 8 GB DDR4 memory, running Ubuntu 20.04. It has 32 KB L1 cache per core, 1 MB shared L2 and 6 MB shared L3 cache.
Figure 4: Runtime distribution of the Softmax kernel with $8192 \times 8192$ input matrix. Measured on the compute core (3rd of 5 RISC-V cores) of the first Tensix core from three different rows in the core grid. For example, load1 means loading the first row of tiles. The Tensix core in the first row processes one additional row of tiles.
Figure 5: Runtime distribution of the fused kernel across varying input dimensions. Measured on the compute core (3rd of 5 RISC-V cores) of the fastest Tensix core.
...and 1 more figures

Attention in SRAM on Tenstorrent Grayskull

TL;DR

Abstract

Attention in SRAM on Tenstorrent Grayskull

Authors

TL;DR

Abstract

Table of Contents

Figures (6)