Table of Contents
Fetching ...

Efficient GPU Parallelization of Electronic Transport and Nonequilibrium Dynamics from Electron-Phonon Interactions in the Perturbo Code

Shiyu Peng, Donnie Pinkston, Jia Yao, Sergei Kliavinek, Ivan Maliyov, Marco Bernardi

TL;DR

This work addresses the computational bottleneck of first-principles Boltzmann transport calculations with electron-phonon interactions by delivering a GPU-accelerated Perturbo implementation that uses a novel, fixed-size data structure to organize scattering channels and an accumulation strategy that minimizes atomics and host-device transfers. The optimized OpenACC-based GPU code achieves roughly 40× speed-ups over the CPU baseline and exhibits near-linear strong scaling up to tens of GPU nodes, enabling efficient exploration of transport and ultrafast dynamics in complex materials and preparing Perturbo for exascale platforms. The approach is validated on GaAs, graphene, and Si systems (with and without SOC), with substantial reductions in the active scattering-channel space and detailed performance and memory analyses. The method is broadly applicable to other scattering mechanisms and paves the way for large-scale, high-resolution e-ph physics studies on next-generation HPC systems.

Abstract

The Boltzmann transport equation (BTE) with electron-phonon (e-ph) interactions computed from first principles is widely used to study electronic transport and nonequilibrium dynamics in materials. Calculating the e-ph collision integral is the most important step in the BTE, but it remains computationally costly, even with current MPI+OpenMP parallelization. This challenge makes it difficult to study materials with large unit cells and to achieve high resolution in momentum space. Here, we show acceleration of BTE calculations of electronic transport and ultrafast dynamics using graphical processing units (GPUs). We implement a novel data structure and algorithm, optimized for GPU hardware and developed using OpenACC, to process scattering channels and efficiently compute the collision integral. This approach significantly reduces the overhead for data referencing, movement, and synchronization. Relative to the efficient CPU implementation in the open-source package Perturbo (v2.2.0), used as a baseline, this approach achieves a speed-up of 40 times for both transport and nonequilibrium dynamics on GPU hardware, and achieves nearly linear scaling up to 100 GPUs. The novel data structure can be generalized to other electron interactions and scattering processes. We released this GPU implementation in the latest public version (v3.0.0) of Perturbo. The new MPI+OpenMP+GPU parallelization enables sweeping studies of e-ph physics and electron dynamics in conventional and quantum materials, and prepares Perturbo for exascale supercomputing platforms.

Efficient GPU Parallelization of Electronic Transport and Nonequilibrium Dynamics from Electron-Phonon Interactions in the Perturbo Code

TL;DR

This work addresses the computational bottleneck of first-principles Boltzmann transport calculations with electron-phonon interactions by delivering a GPU-accelerated Perturbo implementation that uses a novel, fixed-size data structure to organize scattering channels and an accumulation strategy that minimizes atomics and host-device transfers. The optimized OpenACC-based GPU code achieves roughly 40× speed-ups over the CPU baseline and exhibits near-linear strong scaling up to tens of GPU nodes, enabling efficient exploration of transport and ultrafast dynamics in complex materials and preparing Perturbo for exascale platforms. The approach is validated on GaAs, graphene, and Si systems (with and without SOC), with substantial reductions in the active scattering-channel space and detailed performance and memory analyses. The method is broadly applicable to other scattering mechanisms and paves the way for large-scale, high-resolution e-ph physics studies on next-generation HPC systems.

Abstract

The Boltzmann transport equation (BTE) with electron-phonon (e-ph) interactions computed from first principles is widely used to study electronic transport and nonequilibrium dynamics in materials. Calculating the e-ph collision integral is the most important step in the BTE, but it remains computationally costly, even with current MPI+OpenMP parallelization. This challenge makes it difficult to study materials with large unit cells and to achieve high resolution in momentum space. Here, we show acceleration of BTE calculations of electronic transport and ultrafast dynamics using graphical processing units (GPUs). We implement a novel data structure and algorithm, optimized for GPU hardware and developed using OpenACC, to process scattering channels and efficiently compute the collision integral. This approach significantly reduces the overhead for data referencing, movement, and synchronization. Relative to the efficient CPU implementation in the open-source package Perturbo (v2.2.0), used as a baseline, this approach achieves a speed-up of 40 times for both transport and nonequilibrium dynamics on GPU hardware, and achieves nearly linear scaling up to 100 GPUs. The novel data structure can be generalized to other electron interactions and scattering processes. We released this GPU implementation in the latest public version (v3.0.0) of Perturbo. The new MPI+OpenMP+GPU parallelization enables sweeping studies of e-ph physics and electron dynamics in conventional and quantum materials, and prepares Perturbo for exascale supercomputing platforms.

Paper Structure

This paper contains 13 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Schematic of the scatter_base data structure. The information for each $(\boldsymbol{k}, \boldsymbol{q})$ pair is stored as a separate entry in scatter_base, as shown using different colors. The variables ik, ikq, nchl are indices of the $\boldsymbol{k}$ and $\boldsymbol{k+q}$ points and the number of active scattering channels, respectively. The variables eph_g2 and bnds_idx are arrays holding, respectively, the squared norm of the $e$-ph matrix elements and the joint indices of bands and phonon modes for each scattering channel. The relation of these variables to the collision integral $\mathcal{I}(n,\boldsymbol{k})$, whose components can be updated by multiple processes and threads simultaneously, is shown using red arrows.
  • Figure 2: Data structure optimized for GPUs.a Setup of the scattering channels and target layer. Relevant quantities for the $(\boldsymbol{k}, \boldsymbol{q})$ pairs are stored into multiple arrays: scatter, which stores the indexes of the $\boldsymbol{k}$ and $\boldsymbol{k+q}$ points, and scatter_channels, which stores information for all scattering channels, such as the square of the $e$-ph matrix elements (eph_g2), the joint indices of bands and phonon modes (bnds_idx), and the index of the $(\boldsymbol{k}, \boldsymbol{q})$ pair (kq_index). Scattering channels shown with the same color are associated with the same $(\boldsymbol{k}, \boldsymbol{q})$ pair. In addition, stargets_sources indexes the elements of the collision integral $\mathcal{I}(n,\boldsymbol{k})$ and the position of the scattering channels (sc_idx) in scatter_channels. The positive (negative) sign of sc_idx reflects how that entry contributes to the collision integral. The rows of stargets_sources are arranged in order, with rows sharing the same $(n,\boldsymbol{k})$ grouped together, as shown with curly braces. Each such group is called a target, and for each group, the position in stargets_sources (src), the length (len), and the combined $(n,\boldsymbol{k})$ index (nk_index) are stored in targets. Together, stargets_sources and targets constitute the target layer. b Calculation of the contribution to the collision integral from each scattering channel, defined in Eq. \ref{['eq:integraldyn']}, which is computed and stored in sc_col. c Update of collision integrals $\mathcal{I}(n,\boldsymbol{k})$. Each element of $\mathcal{I}$ is updated by one target and one thread of execution. Using the target layer described in (a), each target is able to find the contribution of all the associated scattering channels in sc_col, as shown with red arrows.
  • Figure 3: Simulation setup for four systems.a Electrons in GaAs, b electrons in graphene, c hole carriers in silicon, and d holes in silicon with SOC. Band structures are shown together with the selected energy windows (shaded regions) and the initial populations for the nonequilibrium dynamics simulations (red dots). Energies are shifted so that the Fermi energy is at 0 eV.
  • Figure 4: Performance of the optimized GPU implementation.a-d, performance, and e-h, memory usage, for the four systems studied here, respectively. The left panels, a–d, show the wall time (in seconds, on a logarithmic scale) for ultrafast dynamics (blue) and transport (red) calculations. The speedup values, obtained as the ratio of Baseline-CPU to optimized-GPU code wall times, are given above each bar in the optimized-GPU results. The right panels, e-h, give the memory usage (in GB) on CPU (solid colors) and GPU (striped bars) for the same systems. Memory usage values annotated in the plot are referenced to the baseline CPU results.
  • Figure 5: Strong-scaling performance. Speedup versus number of GPU nodes for a GaAs, b graphene, c Si, and d Si with SOC. Results for the optimized-GPU code are shown using solid lines with symbols for ultrafast dynamics (purple) and transport (red). The dashed line shows the ideal linear scaling. Common scenarios for most users ($\leq 20$ GPU nodes) are indicated with shaded regions.