Table of Contents
Fetching ...

On the energy efficiency of sparse matrix computations on multi-GPU clusters

Massimo Bernaschi, Alessandro Celestini, Pasqua D'Ambra, Giorgio Richelli

TL;DR

This work analyzes the energy efficiency of sparse matrix computations on multi-GPU clusters by profiling BootCMatchGX, a GPU-accelerated library for Krylov solvers and AMG preconditioners. It introduces a detailed methodology for measuring power consumption and differentiating dynamic from static energy, then compares BootCMatchGX against Ginkgo and NVIDIA AmgX across SpMV, CG, and PCG on strong and weak scalability tests. The results show BootCMatchGX delivering lower execution times and reduced dynamic energy in most cases, with PCG benefiting most from its preconditioner design. The findings underscore the importance of kernel and communication optimizations for sustainable HPC and propose future directions in mixed-precision AMG and cross-domain validation.

Abstract

We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific applications. Our primary development objective was to maximize parallel performance and scalability in solving sparse linear systems whose dimensions far exceed the memory capacity of a single node. To this end, we devised methods that expose a high degree of parallelism while optimizing algorithmic implementations for efficient multi-GPU usage. Previous work has already demonstrated the library's performance efficiency on large-scale systems comprising thousands of NVIDIA GPUs, achieving improvements over state-of-the-art solutions. In this paper, we extend those results by providing energy profiles that address the growing sustainability requirements of modern HPC platforms. We present our methodology and tools for accurate runtime energy measurements of the library's core components and discuss the findings. Our results confirm that optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption. Moreover, we show that the library delivers substantial advantages over comparable software frameworks on standard benchmarks.

On the energy efficiency of sparse matrix computations on multi-GPU clusters

TL;DR

This work analyzes the energy efficiency of sparse matrix computations on multi-GPU clusters by profiling BootCMatchGX, a GPU-accelerated library for Krylov solvers and AMG preconditioners. It introduces a detailed methodology for measuring power consumption and differentiating dynamic from static energy, then compares BootCMatchGX against Ginkgo and NVIDIA AmgX across SpMV, CG, and PCG on strong and weak scalability tests. The results show BootCMatchGX delivering lower execution times and reduced dynamic energy in most cases, with PCG benefiting most from its preconditioner design. The findings underscore the importance of kernel and communication optimizations for sustainable HPC and propose future directions in mixed-precision AMG and cross-domain validation.

Abstract

We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific applications. Our primary development objective was to maximize parallel performance and scalability in solving sparse linear systems whose dimensions far exceed the memory capacity of a single node. To this end, we devised methods that expose a high degree of parallelism while optimizing algorithmic implementations for efficient multi-GPU usage. Previous work has already demonstrated the library's performance efficiency on large-scale systems comprising thousands of NVIDIA GPUs, achieving improvements over state-of-the-art solutions. In this paper, we extend those results by providing energy profiles that address the growing sustainability requirements of modern HPC platforms. We present our methodology and tools for accurate runtime energy measurements of the library's core components and discuss the findings. Our results confirm that optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption. Moreover, we show that the library delivers substantial advantages over comparable software frameworks on standard benchmarks.

Paper Structure

This paper contains 11 sections, 4 equations, 16 figures.

Figures (16)

  • Figure 1: Execution workflow for CPU and GPU power monitoring.
  • Figure 2: Power–time profile of the SpMV kernel measured within the BootCMatchGX library on a single node equipped with four GPUs. The green and purple markers denote the points at which the GPUs leave and return to the idle state, respectively. These reference points are used to estimate the static power consumption of the GPUs.
  • Figure 3: SpMV execution times under weak and strong scalability scenarios.
  • Figure 4: Dynamic energy consumption breakdown of the SpMV computation on GPU and CPU under weak and strong scalability scenarios.
  • Figure 5: GPU power peak of the SpMV computation under weak and strong scalability scenarios.
  • ...and 11 more figures