Electron-phonon physics at the exascale: A hybrid MPI-GPU-OpenMP framework for scalable Wannier interpolation

Tae Yun Kim; Zhe Liu; Sabyasachi Tiwari; Elena R. Margine; Feliciano Giustino

Electron-phonon physics at the exascale: A hybrid MPI-GPU-OpenMP framework for scalable Wannier interpolation

Tae Yun Kim, Zhe Liu, Sabyasachi Tiwari, Elena R. Margine, Feliciano Giustino

TL;DR

A GPU porting strategy that integrates naturally into the current EPW implementation, and is seamlessly portable to NVIDIA, AMD, and Intel GPUs is designed, which achieves up to 29-fold speedup on leadership-class supercomputers equipped with NVIDIA and Intel accelerators.

Abstract

We demonstrate a highly efficient GPU implementation of the Wannier interpolation of electron-phonon matrix elements in the EPW code. Building on a systematic analysis of the computational complexity of the algorithm for electron-phonon interpolation, we designed a GPU porting strategy that integrates naturally into the current EPW implementation, and is seamlessly portable to NVIDIA, AMD, and Intel GPUs. We demonstrate this development via extensive benchmarks on conventional semiconductors such as silicon and monolayer MoS$_2$, as well as a large-scale application to topological stanene nanoribbons of width as large as 20nm, which was intractable with previous implementations. Compared to the single MPI parallelization scheme of EPW v5.9, the resulting hybrid MPI-GPU-OpenMP scheme achieves up to 29-fold speedup on leadership-class supercomputers equipped with NVIDIA and Intel accelerators, namely Vista at the Texas Advanced Computing Center, Perlmutter at the National Energy Research Scientific Computing Center, and Aurora at the Argonne Leadership Computing Facility. This framework also achieves nearly ideal scalability up to thousands of GPU nodes on the Aurora supercomputer. With this development, EPW is ready to support electron-phonon physics calculations on exascale platforms.

Electron-phonon physics at the exascale: A hybrid MPI-GPU-OpenMP framework for scalable Wannier interpolation

TL;DR

Abstract

, as well as a large-scale application to topological stanene nanoribbons of width as large as 20nm, which was intractable with previous implementations. Compared to the single MPI parallelization scheme of EPW v5.9, the resulting hybrid MPI-GPU-OpenMP scheme achieves up to 29-fold speedup on leadership-class supercomputers equipped with NVIDIA and Intel accelerators, namely Vista at the Texas Advanced Computing Center, Perlmutter at the National Energy Research Scientific Computing Center, and Aurora at the Argonne Leadership Computing Facility. This framework also achieves nearly ideal scalability up to thousands of GPU nodes on the Aurora supercomputer. With this development, EPW is ready to support electron-phonon physics calculations on exascale platforms.

Paper Structure (19 equations, 13 figures, 3 tables)

This paper contains 19 equations, 13 figures, 3 tables.

Figures (13)

Figure 1: Simplified flowcharts for electron-phonon matrix interpolation.a Single-loop scheme, where the interpolation of the fine-grid matrix $g_{mn\nu}(\mathbf{k},\mathbf{q})$ from the coarse-grid matrix $g_{m'n'\kappa\alpha}(\mathbf{R}_{\rm e},\mathbf{R}_{\rm p})$ is performed by repeating a single-step procedure [Eq. \ref{['eq:g_fine']}] over $\mathbf{k}$ and $\mathbf{q}$ pairs; b Nested-loop scheme, where the interpolation is divided into two steps [Eqs. \ref{['eq:g_fine_1']} and \ref{['eq:g_fine_2']}]. These substeps are carried out at different levels in the nested loop structure: outer and inner loops for $\mathbf{q}$ and $\mathbf{k}$, respectively. This approach requires to allocate a buffer array for storing the intermediate result $g_{m'n'\kappa\alpha}(\mathbf{R}_{\rm e},\mathbf{q})$. The size of this buffer is smaller than that of the coarse-grid matrix by a factor of $N_{\mathbf{q}}$, the number of $\mathbf{q}$ points.
Figure 2: Flowcharts for electron-phonon matrix interpolation implemented in EPW.a Two-level MPI scheme for the interpolation of the electron–phonon matrix $g_{mn\nu}(\mathbf{k},\mathbf{q})$, implemented in EPW 6.0. This approach builds on the nested-loop algorithm (Fig. \ref{['fig:intp_loop']} b), where the interpolation is performed in two substeps (green and blue boxes). The workload is distributed across the image and pool parallelization levels: the $\mathbf{R}_{\rm p}$ and $\mathbf{k}$ indices are divided among pools, and collective MPI reductions (e.g., summation) over pools are required to assemble the $\mathbf{q}$ slice of the intermediate $g_{m'n'\kappa\alpha}(\mathbf{R}_{\rm e},\mathbf{q})$ and the final $g_{mn\nu}(\mathbf{k},\mathbf{q})$ arrays. The $\mathbf{q}$ points are distributed across images, and a final MPI summation over images gives the complete $g_{mn\nu}(\mathbf{k},\mathbf{q})$ for all $\mathbf{q}$ points. b Hybrid MPI--GPU--OpenMP scheme, introduced in EPW 6.1. GPU acceleration and OpenMP multithreading are incorporated into the outer $\mathbf{q}$ loop of the two-level MPI framework. In this design, the Fourier transform that converts $g_{m'n'\kappa\alpha}(\mathbf{R}_{\rm e},\mathbf{R}_{\rm p})$ to $g_{m'n'\kappa\alpha}(\mathbf{R}_{\rm e},\mathbf{q})$ (green box) is offloaded to GPUs for speed, while OpenMP threads allow each MPI rank to fully exploit the available CPU cores.
Figure 3: Overview of the workload distribution in the hybrid MPI-GPU-OpenMP scheme. Each computing node in this example system has two GPUs and eight hardware threads (CPU cores). The workload is distributed with one image per node, two pools per image, one GPU per pool, and four OpenMP threads per pool.
Figure 4: Crystal structure of systems used in benchmark calculations.a Bulk silicon (conventional cell); b Two-dimensional MoS$_2$ ($3\times3\times1$ supercell). The figures were generated with the VESTA program Momma2008.
Figure 5: Comparison of single-node performance of electron-phonon matrix interpolation.a Relative speedup and b wall time in minutes, measured on three supercomputers: Vista at the Texas Advanced Computing Center (TACC), Aurora at the Argonne Leadership Computing Facility (ALCF), and Perlmutter at the National Energy Research Scientific Computing Center (NERSC). Benchmarks are based on ab initio Boltzmann transport calculations for bulk silicon. For each system, the baseline calculation was performed on a single CPU node using the single-level MPI scheme (1L MPI, EPW 5.9). The performance improvements of the two-level MPI (2L MPI, EPW 6.0) and the hybrid MPI-GPU-OpenMP (Hybrid, EPW 6.1) schemes are reported relative to this baseline.
...and 8 more figures