Table of Contents
Fetching ...

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Massimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini, Giacomo Piperno, Pasqua D'Ambra

TL;DR

This work addresses the bottleneck of communication in solving large sparse SPD linear systems on GPU-accelerated clusters by implementing a preconditioned s-step CG method that aggregates computations and overlaps communication with computation. The authors design building blocks that exploit GPU throughput, integrate a scalable AMG preconditioner, and provide an open-source library BootCMatchGX for distributed multi-GPU environments. Numerical experiments on Poisson-discretized problems up to $n=10^9$ unknowns demonstrate favorable strong and weak scalability, with significant gains when using larger $s$ and AMG preconditioning. The approach offers practical impact for PDE-based simulations and data-driven scientific computing where efficient, scalable sparse solvers are critical.

Abstract

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

TL;DR

This work addresses the bottleneck of communication in solving large sparse SPD linear systems on GPU-accelerated clusters by implementing a preconditioned s-step CG method that aggregates computations and overlaps communication with computation. The authors design building blocks that exploit GPU throughput, integrate a scalable AMG preconditioner, and provide an open-source library BootCMatchGX for distributed multi-GPU environments. Numerical experiments on Poisson-discretized problems up to unknowns demonstrate favorable strong and weak scalability, with significant gains when using larger and AMG preconditioning. The approach offers practical impact for PDE-based simulations and data-driven scientific computing where efficient, scalable sparse solvers are critical.

Abstract

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.
Paper Structure (11 sections, 7 equations, 6 figures, 6 algorithms)

This paper contains 11 sections, 7 equations, 6 figures, 6 algorithms.

Figures (6)

  • Figure 1: Strong scalability: breakdown of solve time when no preconditioner is applied.
  • Figure 2: Strong scalability: solve time per iteration when no preconditioner is applied.
  • Figure 3: Weak Scalability: breakdown of solve time when no preconditioner is applied.
  • Figure 4: Scaled Speedup of the solve time when no preconditioner is applied.
  • Figure 5: Weak Scalability: breakdown of solve time when AMG preconditioner is applied.
  • ...and 1 more figures