Do We Need Tensor Cores for Stencil Computations?

Qiqi Gu; Chenpeng Wu; Heng Shi; Jianguo Yao; Haibing Guan

Do We Need Tensor Cores for Stencil Computations?

Qiqi Gu, Chenpeng Wu, Heng Shi, Jianguo Yao, Haibing Guan

TL;DR

A systematic performance analysis of stencil computations on Tensor Cores is conducted, revisiting the adaptation of stencils onto Tensor Cores and derived analytical criteria to determine the suitability of Tensor Cores for varying stencil workloads.

Abstract

Stencil computation constitutes a cornerstone of scientific computing, serving as a critical kernel in domains ranging from fluid dynamics to weather simulation. While stencil computations are conventionally regarded as memory-bound and thus unsuitable for compute-centric Tensor Cores, recent empirical studies have demonstrated significant speedups after applying Tensor Cores, forming an apparent contradiction. This paper resolves this contradiction by conducting a systematic performance analysis of stencil computations on Tensor Cores. We begin by revisiting the adaptation of stencils onto Tensor Cores, quantifying the computational redundancy introduced by the transformations required to satisfy hardware constraints. These metrics are subsequently integrated into an enhanced performance model that explicitly accounts for the arithmetic intensity shifts driven by temporal fusion. Guided by this formulation, we derive analytical criteria to determine the suitability of Tensor Cores for varying stencil workloads. By classifying operational regions, we identify the specific \textit{sweet spot} for effective acceleration and further demonstrate how Sparse Tensor Cores expand this profitable design space. Extensive evaluations on NVIDIA GPUs across SOTA implementations, including DRStencil, EBISU, ConvStencil, and SPIDER, validate our performance model and analytical criteria. These results demonstrate the effectiveness of our approach in guiding stencil performance optimization.

Do We Need Tensor Cores for Stencil Computations?

TL;DR

Abstract

Paper Structure (31 sections, 20 equations, 16 figures, 4 tables)

This paper contains 31 sections, 20 equations, 16 figures, 4 tables.

Introduction
What Happened When Introducing Tensor Cores?
Constraints Imposed by Tensor Cores
Tensor Contraction Constraint.
Operand Size Constraint.
Adapting Stencil onto Tensor Cores
Bridging Tensor Contraction Mismatch.
Aligning with Operand Size Requirement.
Handling Stencils with Small Radius.
Performance Formulation
Roofline Model
Performance on Different Hardware
Original Stencil Problem
CUDA Core Implementation with Temporal Fusion
Tensor Core Implementation with Kernel Fusion
...and 16 more sections

Figures (16)

Figure 1: Stencil Computation Pattern.
Figure 2: Performance Comparison between CUDA Core and Tensor Core Implementations.
Figure 3: Reduction Dimension Mismatch between Matrix Multiplication and Stencil.
Figure 4: Two Typical Transformation Schemes to Adapt Stencil Computation onto Tensor Cores.
Figure 5: Transformed Sparse Matrices in Recent Tensor Core-based Implementations.
...and 11 more figures

Do We Need Tensor Cores for Stencil Computations?

TL;DR

Abstract

Do We Need Tensor Cores for Stencil Computations?

Authors

TL;DR

Abstract

Table of Contents

Figures (16)