Table of Contents
Fetching ...

PSCToolkit: solving sparse linear systems with a large number of GPUs

Pasqua D'Ambra, Fabio Durastante, Salvatore Filippone

TL;DR

PSCToolkit delivers a GPU-accelerated toolkit for solving large sparse linear systems on HPC platforms, targeting symmetric positive-definite problems and scalability to thousands of GPUs. It combines three components—PSBLAS, AMG4PSBLAS, and PSBLAS extensions—with GPU-aware memory management via mold variables and the Hacked ELLPACK format to enable efficient Krylov solvers preconditioned by AMG. Through extensive experiments on EuroHPC Leonardo, the authors demonstrate weak and strong scaling, competitive iterations and solve times compared to AMGX, and emphasize low operator complexity to sustain performance as GPU counts grow. The work points to future enhancements in OpenMP/OpenACC integration, polynomial smoothers, and broader hardware support, aiming to improve portability and maintainability while preserving high performance.

Abstract

In this chapter, we describe the Parallel Sparse Computation Toolkit (PSCToolkit), a suite of libraries for solving large-scale linear algebra problems in an HPC environment. In particular, we focus on the tools provided for the solution of symmetric and positive-definite linear systems using up to 8192 GPUs on the EuroHPC-JU Leonardo supercomputer. PSCToolkit is an ongoing mathematical software project aimed at exploiting the extreme computational speed of current supercomputers for relevant problems in Computational and Data Science. The toolkit is designed for node-level efficiency, flexibility and usability, supporting integration with both Fortran and C/C++, enabling researchers and developers from diverse computational backgrounds to leverage its powerful capabilities.

PSCToolkit: solving sparse linear systems with a large number of GPUs

TL;DR

PSCToolkit delivers a GPU-accelerated toolkit for solving large sparse linear systems on HPC platforms, targeting symmetric positive-definite problems and scalability to thousands of GPUs. It combines three components—PSBLAS, AMG4PSBLAS, and PSBLAS extensions—with GPU-aware memory management via mold variables and the Hacked ELLPACK format to enable efficient Krylov solvers preconditioned by AMG. Through extensive experiments on EuroHPC Leonardo, the authors demonstrate weak and strong scaling, competitive iterations and solve times compared to AMGX, and emphasize low operator complexity to sustain performance as GPU counts grow. The work points to future enhancements in OpenMP/OpenACC integration, polynomial smoothers, and broader hardware support, aiming to improve portability and maintainability while preserving high performance.

Abstract

In this chapter, we describe the Parallel Sparse Computation Toolkit (PSCToolkit), a suite of libraries for solving large-scale linear algebra problems in an HPC environment. In particular, we focus on the tools provided for the solution of symmetric and positive-definite linear systems using up to 8192 GPUs on the EuroHPC-JU Leonardo supercomputer. PSCToolkit is an ongoing mathematical software project aimed at exploiting the extreme computational speed of current supercomputers for relevant problems in Computational and Data Science. The toolkit is designed for node-level efficiency, flexibility and usability, supporting integration with both Fortran and C/C++, enabling researchers and developers from diverse computational backgrounds to leverage its powerful capabilities.
Paper Structure (10 sections, 6 equations, 11 figures)

This paper contains 10 sections, 6 equations, 11 figures.

Figures (11)

  • Figure 1: Assignment of indexes in global and local numbering for two processes. The left panel represents the adjacency graph of the matrix considered with its global numbering, the two lateral graphs are the subgraphs relating to the locally numbered indices of the two processes.
  • Figure 2: Example of usage of the mold option to have data residing on the GPU. Specifically we require the GPU version of the Hacked ELLPack format Filippone:2017 with the corresponding dense GPU vector formats for the right-hand side , and the communicator index space .
  • Figure 3: Example of the aggregates obtained with the two different schemes. On the left, the aggregation of Vaněk, Mandel and Brezina, on the right the aggregation obtained with the approximate matching of maximum weight.
  • Figure 4: Preconditioner instatiation and selection for a multigrid preconditioner using a V-cycle with the smoothed aggregation procedure based on the weighted graph matching procedure.
  • Figure 5: Set up of the $\ell_1$-Jacobi method both as smoother (4 iterations) and as coarse solver (30 iterations), and calls to construct the hierarchy and the smoother
  • ...and 6 more figures