Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Massimo Bernaschi; Mauro G. Carrozzo; Alessandro Celestini; Giacomo Piperno; Pasqua D'Ambra

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Massimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini, Giacomo Piperno, Pasqua D'Ambra

TL;DR

This work addresses the bottleneck of communication in solving large sparse SPD linear systems on GPU-accelerated clusters by implementing a preconditioned s-step CG method that aggregates computations and overlaps communication with computation. The authors design building blocks that exploit GPU throughput, integrate a scalable AMG preconditioner, and provide an open-source library BootCMatchGX for distributed multi-GPU environments. Numerical experiments on Poisson-discretized problems up to $n=10^9$ unknowns demonstrate favorable strong and weak scalability, with significant gains when using larger $s$ and AMG preconditioning. The approach offers practical impact for PDE-based simulations and data-driven scientific computing where efficient, scalable sparse solvers are critical.

Abstract

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

TL;DR

unknowns demonstrate favorable strong and weak scalability, with significant gains when using larger

and AMG preconditioning. The approach offers practical impact for PDE-based simulations and data-driven scientific computing where efficient, scalable sparse solvers are critical.

Abstract

Paper Structure (11 sections, 7 equations, 6 figures, 6 algorithms)

This paper contains 11 sections, 7 equations, 6 figures, 6 algorithms.

Introduction
Background
Preconditioned $s$-step CG
Multi-GPU Design and Implementation Issues
Parallel Preconditioner
Related Work
Numerical Results
Strong Scalability
Weak Scalability
Concluding Remarks
Acknowledgements

Figures (6)

Figure 1: Strong scalability: breakdown of solve time when no preconditioner is applied.
Figure 2: Strong scalability: solve time per iteration when no preconditioner is applied.
Figure 3: Weak Scalability: breakdown of solve time when no preconditioner is applied.
Figure 4: Scaled Speedup of the solve time when no preconditioner is applied.
Figure 5: Weak Scalability: breakdown of solve time when AMG preconditioner is applied.
...and 1 more figures

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

TL;DR

Abstract

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Authors

TL;DR

Abstract

Table of Contents

Figures (6)