Large Scale Finite-Temperature Real-time Time Dependent Density Functional Theory Calculation with Hybrid Functional on ARM and GPU Systems
Rongrong Liu, Zhuoqiang Guo, Qiuchen Sha, Tong Zhao, Haibo Li, Wei Hu, Lijun Liu, Guangming Tan, Weile Jia
TL;DR
The paper addresses the prohibitive cost of finite-temperature rt-TDDFT with hybrid functionals by coupling the PT-IM framework with a set of algorithmic and hardware optimizations. It introduces occupation-matrix diagonalization to reduce exchange and density calculations, and ACE to lower the frequency of expensive Fock-exchange evaluations, while ring-based and asynchronous communication plus shared memory mitigate memory and communication bottlenecks. These advances yield substantial per-step speedups on ARM and GPU architectures (up to about 50× in some cases) and enable simulations of up to 3072 atoms, with demonstrated accuracy against conventional RK4 dynamics. The work significantly broadens the practicality of large-scale, finite-temperature rt-TDDFT with hybrid functionals for real materials on modern heterogeneous HPC systems.
Abstract
Ultra-fast electronic phenomena originating from finite temperature, such as nonlinear optical excitation, can be simulated with high fidelity via real-time time dependent density functional theory (rt-TDDFT) calculations with hybrid functional. However, previous rt-TDDFT simulations of real materials using the optimal gauge--known as the parallel transport gauge--have been limited to low-temperature systems with band gaps. In this paper, we introduce the parallel transport-implicit midpoint (PT-IM) method, which significantly accelerates finite-temperature rt-TDDFT calculations of real materials with hybrid function. We first implement PT-IM with hybrid functional in our plane wave code PWDFT, and optimized it on both GPU and ARM platforms to build a solid baseline code. Next, we propose a diagonalization method to reduce computation and communication complexity, and then, we employ adaptively compressed exchange (ACE) method to reduce the frequency of the most expensive Fock exchange operator. Finally, we adopt the ring\_based method and the shared memory mechanism to overlap computation and communication and alleviate memory consumption respectively. Numerical results show that our optimized code can reach 3072 atoms for rt-TDDFT simulation with hybrid functional at finite temperature on 192 computing nodes, the time-to-solution for one time step is 429.3s, which is 41.4 times faster compared to the baseline.
