Efficient GPU Parallelization of Electronic Transport and Nonequilibrium Dynamics from Electron-Phonon Interactions in the Perturbo Code
Shiyu Peng, Donnie Pinkston, Jia Yao, Sergei Kliavinek, Ivan Maliyov, Marco Bernardi
TL;DR
This work addresses the computational bottleneck of first-principles Boltzmann transport calculations with electron-phonon interactions by delivering a GPU-accelerated Perturbo implementation that uses a novel, fixed-size data structure to organize scattering channels and an accumulation strategy that minimizes atomics and host-device transfers. The optimized OpenACC-based GPU code achieves roughly 40× speed-ups over the CPU baseline and exhibits near-linear strong scaling up to tens of GPU nodes, enabling efficient exploration of transport and ultrafast dynamics in complex materials and preparing Perturbo for exascale platforms. The approach is validated on GaAs, graphene, and Si systems (with and without SOC), with substantial reductions in the active scattering-channel space and detailed performance and memory analyses. The method is broadly applicable to other scattering mechanisms and paves the way for large-scale, high-resolution e-ph physics studies on next-generation HPC systems.
Abstract
The Boltzmann transport equation (BTE) with electron-phonon (e-ph) interactions computed from first principles is widely used to study electronic transport and nonequilibrium dynamics in materials. Calculating the e-ph collision integral is the most important step in the BTE, but it remains computationally costly, even with current MPI+OpenMP parallelization. This challenge makes it difficult to study materials with large unit cells and to achieve high resolution in momentum space. Here, we show acceleration of BTE calculations of electronic transport and ultrafast dynamics using graphical processing units (GPUs). We implement a novel data structure and algorithm, optimized for GPU hardware and developed using OpenACC, to process scattering channels and efficiently compute the collision integral. This approach significantly reduces the overhead for data referencing, movement, and synchronization. Relative to the efficient CPU implementation in the open-source package Perturbo (v2.2.0), used as a baseline, this approach achieves a speed-up of 40 times for both transport and nonequilibrium dynamics on GPU hardware, and achieves nearly linear scaling up to 100 GPUs. The novel data structure can be generalized to other electron interactions and scattering processes. We released this GPU implementation in the latest public version (v3.0.0) of Perturbo. The new MPI+OpenMP+GPU parallelization enables sweeping studies of e-ph physics and electron dynamics in conventional and quantum materials, and prepares Perturbo for exascale supercomputing platforms.
