Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters
Massimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini, Giacomo Piperno, Pasqua D'Ambra
TL;DR
This work addresses the bottleneck of communication in solving large sparse SPD linear systems on GPU-accelerated clusters by implementing a preconditioned s-step CG method that aggregates computations and overlaps communication with computation. The authors design building blocks that exploit GPU throughput, integrate a scalable AMG preconditioner, and provide an open-source library BootCMatchGX for distributed multi-GPU environments. Numerical experiments on Poisson-discretized problems up to $n=10^9$ unknowns demonstrate favorable strong and weak scalability, with significant gains when using larger $s$ and AMG preconditioning. The approach offers practical impact for PDE-based simulations and data-driven scientific computing where efficient, scalable sparse solvers are critical.
Abstract
Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.
