Table of Contents
Fetching ...

PETSc/TAO Developments for GPU-Based Early Exascale Systems

Richard Tran Mills, Mark Adams, Satish Balay, Jed Brown, Jacob Faibussowitsch, Toby Isaac, Matthew Knepley, Todd Munson, Hansol Suh, Stefano Zampini, Hong Zhang, Junchao Zhang

TL;DR

Some of the challenges that designers of numerical libraries face are recap, and the many developments made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure are discussed.

Abstract

The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy's Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.

PETSc/TAO Developments for GPU-Based Early Exascale Systems

TL;DR

Some of the challenges that designers of numerical libraries face are recap, and the many developments made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure are discussed.

Abstract

The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy's Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.
Paper Structure (24 sections, 3 equations, 13 figures, 2 tables)

This paper contains 24 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Two star-forest examples. The left example has three MPI ranks, while the right has two. Vertical dashed lines separate MPI ranks. Colored boxes are roots (leaves). Enclosed numbers are indices of the roots (leaves) in their index space. Dashed boxes represent holes in the spaces not belonging to the SF.
  • Figure 2: A typical data path of PetscSFReduce() with CUDA, assuming all parts except MPI work on a common device stream local to the calling process. Note the stream synchronization before MPI_Isend().
  • Figure 3: SF-pingpong test: MPI latency between two closest GPUs on the four (pre-)exascale machines, and between two CPU cores within a compute node on Perlmutter. Note the strikingly better performance on Frontier with small messages compared with other machines.
  • Figure 4: SF-unpack test: the test is similar to Figure \ref{['fig:SF-Pingpong']}, except the latency contains the execution time of the unpack kernel after receiving data. Note the performance on Frontier will small messages did not stand out anymore as in Figure \ref{['fig:SF-Pingpong']}. Also note GPUs have much better performance than CPUs in the unpack kernel with big messages thanks to their higher memory bandwidth.
  • Figure 5: Timeline of CG (top) and CGAsync (bottom) on rank 2. Each ran ten iterations. The blue csr... bars are csrMV (i.e., SpMV) kernels in cuSPARSE, and the red c... bars are cudaMemcpyAsync() copying data from device to host.
  • ...and 8 more figures