Heterogeneous computing in a strongly-connected CPU-GPU environment: fast multiple time-evolution equation-based modeling accelerated using data-driven approach

Tsuyoshi Ichimura; Kohei Fujita; Muneo Hori; Lalith Maddegedara; Jack Wells; Alan Gray; Ian Karlin; John Linford

Heterogeneous computing in a strongly-connected CPU-GPU environment: fast multiple time-evolution equation-based modeling accelerated using data-driven approach

Tsuyoshi Ichimura, Kohei Fujita, Muneo Hori, Lalith Maddegedara, Jack Wells, Alan Gray, Ian Karlin, John Linford

TL;DR

A CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution, indicating that directives are highly effective in analyses in heterogeneous computing environments.

Abstract

We propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a single-GH200 node, the proposed method improved the computation speed by 86.4 and 8.67 times compared to the conventional method run only on CPU and only on GPU, respectively. Furthermore, the energy-to-solution was reduced by 32.2-fold (from 9944 J to 309 J) and 7.01-fold (from 2163 J to 309 J) when compared to using only the CPU and GPU, respectively. Using the proposed method on the Alps supercomputer, a 51.6-fold and 6.98-fold speedup was attained when compared to using only the CPU and GPU, respectively, and a high weak scaling efficiency of 94.3% was obtained up to 1,920 compute nodes. These implementations were realized using directive-based parallel programming models while enabling portability, indicating that directives are highly effective in analyses in heterogeneous computing environments.

Heterogeneous computing in a strongly-connected CPU-GPU environment: fast multiple time-evolution equation-based modeling accelerated using data-driven approach

TL;DR

Abstract

Paper Structure (10 sections, 9 equations, 5 figures, 4 tables, 4 algorithms)

This paper contains 10 sections, 9 equations, 5 figures, 4 tables, 4 algorithms.

Introduction
Method
Target problem & baseline method
Proposed heterogeneous computing method
Numerical Experiment
Problem setting & results of application example
Concrete form of CRS-CG@CPU, CRS-CG@GPU, EBE-MCG@CPU-GPU, and CRS-CG@CPU-GPU
Performance measurement on a single-GH200 node
Performance measurement on Alps
Concluding Remarks

Figures (5)

Figure 1: Target ground structure and results of frequency domain decomposition. All ground structures have a flat surface but different interface shapes between the sedimentary layer and bedrock. All models have dimensions of 950$\times$950$\times$120 m with a minimum element size of 2.5 m for resolving the frequency components up to 5.0 Hz. The number of second-order tetrahedral nodes and elements in model a are 15,509,903 and 11,365,697, respectively (the number of unknowns in Eq. \ref{['GE:DIS']} is 46,529,709).
Figure 2: Proposed heterogeneous computational algorithm implemented on multiple compute nodes. The finite element model of the target domain is partitioned into the number of compute nodes, and each compute node executes Algorithm \ref{['EBE-MCG@CPU-GPU']} using two MPI processes. As the predictor does not require information exchange between partitions, inter-node communication is used only in the solver@GPU, so that the nodal values between partitions are consistent.
Figure 3: Convergence history of the solver for each initial solution estimation method for one time step. Compared to the Adams-Bashforth method used in conventional methods, the number of iterations required to fulfill the error threshold of $\epsilon=10^{-8}$ is reduced by using the data-driven predictor.
Figure 4: Breakdown of elapsed time and selection of $s$ during the simulation in EBE-MCG@CPU-GPU on a single-GH200 node. Although the convergence of the problem changes during the time-history simulation, a suitable $s$ is selected such that the elapsed time of the solver and predictor becomes balanced.
Figure 5: Weak scaling of EBE-MCG@CPU-GPU on Alps. Time is shown for the average elapsed time per time step between 250--500th time-steps simulation per problem case.

Heterogeneous computing in a strongly-connected CPU-GPU environment: fast multiple time-evolution equation-based modeling accelerated using data-driven approach

TL;DR

Abstract

Heterogeneous computing in a strongly-connected CPU-GPU environment: fast multiple time-evolution equation-based modeling accelerated using data-driven approach

Authors

TL;DR

Abstract

Table of Contents

Figures (5)