Table of Contents
Fetching ...

Do We Need Tensor Cores for Stencil Computations?

Qiqi Gu, Chenpeng Wu, Heng Shi, Jianguo Yao, Haibing Guan

TL;DR

A systematic performance analysis of stencil computations on Tensor Cores is conducted, revisiting the adaptation of stencils onto Tensor Cores and derived analytical criteria to determine the suitability of Tensor Cores for varying stencil workloads.

Abstract

Stencil computation constitutes a cornerstone of scientific computing, serving as a critical kernel in domains ranging from fluid dynamics to weather simulation. While stencil computations are conventionally regarded as memory-bound and thus unsuitable for compute-centric Tensor Cores, recent empirical studies have demonstrated significant speedups after applying Tensor Cores, forming an apparent contradiction. This paper resolves this contradiction by conducting a systematic performance analysis of stencil computations on Tensor Cores. We begin by revisiting the adaptation of stencils onto Tensor Cores, quantifying the computational redundancy introduced by the transformations required to satisfy hardware constraints. These metrics are subsequently integrated into an enhanced performance model that explicitly accounts for the arithmetic intensity shifts driven by temporal fusion. Guided by this formulation, we derive analytical criteria to determine the suitability of Tensor Cores for varying stencil workloads. By classifying operational regions, we identify the specific \textit{sweet spot} for effective acceleration and further demonstrate how Sparse Tensor Cores expand this profitable design space. Extensive evaluations on NVIDIA GPUs across SOTA implementations, including DRStencil, EBISU, ConvStencil, and SPIDER, validate our performance model and analytical criteria. These results demonstrate the effectiveness of our approach in guiding stencil performance optimization.

Do We Need Tensor Cores for Stencil Computations?

TL;DR

A systematic performance analysis of stencil computations on Tensor Cores is conducted, revisiting the adaptation of stencils onto Tensor Cores and derived analytical criteria to determine the suitability of Tensor Cores for varying stencil workloads.

Abstract

Stencil computation constitutes a cornerstone of scientific computing, serving as a critical kernel in domains ranging from fluid dynamics to weather simulation. While stencil computations are conventionally regarded as memory-bound and thus unsuitable for compute-centric Tensor Cores, recent empirical studies have demonstrated significant speedups after applying Tensor Cores, forming an apparent contradiction. This paper resolves this contradiction by conducting a systematic performance analysis of stencil computations on Tensor Cores. We begin by revisiting the adaptation of stencils onto Tensor Cores, quantifying the computational redundancy introduced by the transformations required to satisfy hardware constraints. These metrics are subsequently integrated into an enhanced performance model that explicitly accounts for the arithmetic intensity shifts driven by temporal fusion. Guided by this formulation, we derive analytical criteria to determine the suitability of Tensor Cores for varying stencil workloads. By classifying operational regions, we identify the specific \textit{sweet spot} for effective acceleration and further demonstrate how Sparse Tensor Cores expand this profitable design space. Extensive evaluations on NVIDIA GPUs across SOTA implementations, including DRStencil, EBISU, ConvStencil, and SPIDER, validate our performance model and analytical criteria. These results demonstrate the effectiveness of our approach in guiding stencil performance optimization.
Paper Structure (31 sections, 20 equations, 16 figures, 4 tables)

This paper contains 31 sections, 20 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Stencil Computation Pattern.
  • Figure 2: Performance Comparison between CUDA Core and Tensor Core Implementations.
  • Figure 3: Reduction Dimension Mismatch between Matrix Multiplication and Stencil.
  • Figure 4: Two Typical Transformation Schemes to Adapt Stencil Computation onto Tensor Cores.
  • Figure 5: Transformed Sparse Matrices in Recent Tensor Core-based Implementations.
  • ...and 11 more figures