PETSc/TAO Developments for GPU-Based Early Exascale Systems

Richard Tran Mills; Mark Adams; Satish Balay; Jed Brown; Jacob Faibussowitsch; Toby Isaac; Matthew Knepley; Todd Munson; Hansol Suh; Stefano Zampini; Hong Zhang; Junchao Zhang

PETSc/TAO Developments for GPU-Based Early Exascale Systems

Richard Tran Mills, Mark Adams, Satish Balay, Jed Brown, Jacob Faibussowitsch, Toby Isaac, Matthew Knepley, Todd Munson, Hansol Suh, Stefano Zampini, Hong Zhang, Junchao Zhang

TL;DR

Some of the challenges that designers of numerical libraries face are recap, and the many developments made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure are discussed.

Abstract

The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy's Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.

PETSc/TAO Developments for GPU-Based Early Exascale Systems

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 13 figures, 2 tables)

This paper contains 24 sections, 3 equations, 13 figures, 2 tables.

Introduction
GPU programming challenges and PETSc's responses
On PETSc's code portability:
On PETSc's memory types:
On PETSc's GPU streams:
On PETSc's multiprecision support:
Porting PETSc applications to GPUs
Communication on GPUs
The star-forest (SF) abstraction
The mismatch between MPI and GPUs
Synchronization-free PetscSF with NVSHMEM
GPU-aware MPI message passing latency on (pre-)exascale machines
Portable matrix assembly on GPUs
Asynchronous linear solvers on GPUs
Distributed asynchronous CG with PetscSF over NVSHMEM
...and 9 more sections

Figures (13)

Figure 1: Two star-forest examples. The left example has three MPI ranks, while the right has two. Vertical dashed lines separate MPI ranks. Colored boxes are roots (leaves). Enclosed numbers are indices of the roots (leaves) in their index space. Dashed boxes represent holes in the spaces not belonging to the SF.
Figure 2: A typical data path of PetscSFReduce() with CUDA, assuming all parts except MPI work on a common device stream local to the calling process. Note the stream synchronization before MPI_Isend().
Figure 3: SF-pingpong test: MPI latency between two closest GPUs on the four (pre-)exascale machines, and between two CPU cores within a compute node on Perlmutter. Note the strikingly better performance on Frontier with small messages compared with other machines.
Figure 4: SF-unpack test: the test is similar to Figure \ref{['fig:SF-Pingpong']}, except the latency contains the execution time of the unpack kernel after receiving data. Note the performance on Frontier will small messages did not stand out anymore as in Figure \ref{['fig:SF-Pingpong']}. Also note GPUs have much better performance than CPUs in the unpack kernel with big messages thanks to their higher memory bandwidth.
Figure 5: Timeline of CG (top) and CGAsync (bottom) on rank 2. Each ran ten iterations. The blue csr... bars are csrMV (i.e., SpMV) kernels in cuSPARSE, and the red c... bars are cudaMemcpyAsync() copying data from device to host.
...and 8 more figures

PETSc/TAO Developments for GPU-Based Early Exascale Systems

TL;DR

Abstract

PETSc/TAO Developments for GPU-Based Early Exascale Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (13)