Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Maya Taylor; Carl Pearson; Luc Berger-Vergiat; Giovanni Long; Jan Ciesko

Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Maya Taylor, Carl Pearson, Luc Berger-Vergiat, Giovanni Long, Jan Ciesko

Abstract

As AI accelerators gain prominence, their potential for traditional scientific computing workloads remains unclear. This paper explores Tenstorrent's Wormhole architecture, a spatial computing platform designed for neural network acceleration, by implementing three numerical kernels and composing them into a conjugate gradient solver. We present architecture-specific optimizations for sparse numerical algorithms, evaluate their performance against Nvidia GPUs, and expose both challenges and opportunities in porting numerical methods to spatial architectures. Our results demonstrate that AI accelerators merit consideration for workloads traditionally dominated by CPUs and GPUs, and more work should be invested in understanding the capabilities of these architectures and making them accessible to the scientific computing community.

Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Abstract

Paper Structure (21 sections, 2 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 2 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
Background and Related Work
Tenstorrent Wormhole
Tiles
Circular Buffers
Compute Units
Performance Tracing
Basic Arithmetic Operations
Global Reduction
Partial Result Granularity
Network Routing Patterns
Stencil Collectives
Data Distribution
Tile Shift for the Stencil Computation
Tile Transpose for Contiguous Boundary Exchange
...and 6 more sections

Figures (13)

Figure 1: Block diagram of a Tensix core. Each core has 1.5MB of SRAM, 5 baby RISC-V cores, a vector compute unit (SFPU), and a matrix compute unit (FPU). Two of the baby RISC-V cores are connected to the NoC and manage data movement to/from DRAM and other cores. The other three baby RISC-V cores manage data movement to/from SRAM and the compute units.
Figure 2: Layout of a 32x32 tile logically (left) and physically (right). Colors and indices are used to indicate the interleaving of subtiles in physical memory.
Figure 3: Roofline model for Wormhole Tensix architecture for 16-bit element-wise addition, with implementation variants for FPU and SFPU. Both data points represent runs with 256 tiles per Tensix core, or 262,144 elements.
Figure 4: Outline of the local reduction to compute partial dot-product result tiles for the dot product of two vectors $q$ and $p$. Each core reduces its two input vectors to a single tile, which is then contributed to the global reduction either before/after reduction to a single scalar value.
Figure 5: Weak scaling of the dot product, using two different reduction methods with the SFPU FP32 implementation and 64 tiles per core. Method 1 reduces to a scalar on each core, and method 2 reduces only at the final core.
...and 8 more figures

Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Abstract

Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Authors

Abstract

Table of Contents

Figures (13)