Table of Contents
Fetching ...

Accelerating stencils on the Tenstorrent Grayskull RISC-V accelerator

Nick Brown, Ryan Barton

TL;DR

This paper explores the suitability of stencils on the Grayskull e150, explores best practice in structuring these codes for the accelerator and demonstrates that the e150 provides similar performance to a Xeon Platinum CPU but the e150 uses around five times less energy.

Abstract

The RISC-V Instruction Set Architecture (ISA) has enjoyed phenomenal growth in recent years, however it still to gain popularity in HPC. Whilst adopting RISC-V CPU solutions in HPC might be some way off, RISC-V based PCIe accelerators offer a middle ground where vendors benefit from the flexibility of RISC-V yet fit into existing systems. In this paper we focus on the Tenstorrent Grayskull PCIe RISC-V based accelerator which, built upon Tensix cores, decouples data movement from compute. Using the Jacobi iterative method as a vehicle, we explore the suitability of stencils on the Grayskull e150. We explore best practice in structuring these codes for the accelerator and demonstrate that the e150 provides similar performance to a Xeon Platinum CPU (albeit BF16 vs FP32) but the e150 uses around five times less energy. Over four e150s we obtain around four times the CPU performance, again at around five times less energy.

Accelerating stencils on the Tenstorrent Grayskull RISC-V accelerator

TL;DR

This paper explores the suitability of stencils on the Grayskull e150, explores best practice in structuring these codes for the accelerator and demonstrates that the e150 provides similar performance to a Xeon Platinum CPU but the e150 uses around five times less energy.

Abstract

The RISC-V Instruction Set Architecture (ISA) has enjoyed phenomenal growth in recent years, however it still to gain popularity in HPC. Whilst adopting RISC-V CPU solutions in HPC might be some way off, RISC-V based PCIe accelerators offer a middle ground where vendors benefit from the flexibility of RISC-V yet fit into existing systems. In this paper we focus on the Tenstorrent Grayskull PCIe RISC-V based accelerator which, built upon Tensix cores, decouples data movement from compute. Using the Jacobi iterative method as a vehicle, we explore the suitability of stencils on the Grayskull e150. We explore best practice in structuring these codes for the accelerator and demonstrate that the e150 provides similar performance to a Xeon Platinum CPU (albeit BF16 vs FP32) but the e150 uses around five times less energy. Over four e150s we obtain around four times the CPU performance, again at around five times less energy.
Paper Structure (13 sections, 6 figures, 8 tables)

This paper contains 13 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: A single Tensix core contains five RISC-V baby cores, 1MB of SRAM memory, an FPU and two routers.
  • Figure 2: Illustration of a domain surrounded by boundary conditions for stencil based computation.
  • Figure 3: Initial design, where a Tensix core retrieves data from DRAM, serves it to the compute cores which drive the FPU, and results are then written back to DRAM.
  • Figure 4: Illustration of decomposing the domain into distinct batches of size 32 by 32 BF16 elements.
  • Figure 5: Illustration of additional 256 bit wide allocation on the left and right of the domain, containing empty values apart from the boundary conditions so that writing of 32 by 32 result tiles is always aligned.
  • ...and 1 more figures