Table of Contents
Fetching ...

Large Scale Finite-Temperature Real-time Time Dependent Density Functional Theory Calculation with Hybrid Functional on ARM and GPU Systems

Rongrong Liu, Zhuoqiang Guo, Qiuchen Sha, Tong Zhao, Haibo Li, Wei Hu, Lijun Liu, Guangming Tan, Weile Jia

TL;DR

The paper addresses the prohibitive cost of finite-temperature rt-TDDFT with hybrid functionals by coupling the PT-IM framework with a set of algorithmic and hardware optimizations. It introduces occupation-matrix diagonalization to reduce exchange and density calculations, and ACE to lower the frequency of expensive Fock-exchange evaluations, while ring-based and asynchronous communication plus shared memory mitigate memory and communication bottlenecks. These advances yield substantial per-step speedups on ARM and GPU architectures (up to about 50× in some cases) and enable simulations of up to 3072 atoms, with demonstrated accuracy against conventional RK4 dynamics. The work significantly broadens the practicality of large-scale, finite-temperature rt-TDDFT with hybrid functionals for real materials on modern heterogeneous HPC systems.

Abstract

Ultra-fast electronic phenomena originating from finite temperature, such as nonlinear optical excitation, can be simulated with high fidelity via real-time time dependent density functional theory (rt-TDDFT) calculations with hybrid functional. However, previous rt-TDDFT simulations of real materials using the optimal gauge--known as the parallel transport gauge--have been limited to low-temperature systems with band gaps. In this paper, we introduce the parallel transport-implicit midpoint (PT-IM) method, which significantly accelerates finite-temperature rt-TDDFT calculations of real materials with hybrid function. We first implement PT-IM with hybrid functional in our plane wave code PWDFT, and optimized it on both GPU and ARM platforms to build a solid baseline code. Next, we propose a diagonalization method to reduce computation and communication complexity, and then, we employ adaptively compressed exchange (ACE) method to reduce the frequency of the most expensive Fock exchange operator. Finally, we adopt the ring\_based method and the shared memory mechanism to overlap computation and communication and alleviate memory consumption respectively. Numerical results show that our optimized code can reach 3072 atoms for rt-TDDFT simulation with hybrid functional at finite temperature on 192 computing nodes, the time-to-solution for one time step is 429.3s, which is 41.4 times faster compared to the baseline.

Large Scale Finite-Temperature Real-time Time Dependent Density Functional Theory Calculation with Hybrid Functional on ARM and GPU Systems

TL;DR

The paper addresses the prohibitive cost of finite-temperature rt-TDDFT with hybrid functionals by coupling the PT-IM framework with a set of algorithmic and hardware optimizations. It introduces occupation-matrix diagonalization to reduce exchange and density calculations, and ACE to lower the frequency of expensive Fock-exchange evaluations, while ring-based and asynchronous communication plus shared memory mitigate memory and communication bottlenecks. These advances yield substantial per-step speedups on ARM and GPU architectures (up to about 50× in some cases) and enable simulations of up to 3072 atoms, with demonstrated accuracy against conventional RK4 dynamics. The work significantly broadens the practicality of large-scale, finite-temperature rt-TDDFT with hybrid functionals for real materials on modern heterogeneous HPC systems.

Abstract

Ultra-fast electronic phenomena originating from finite temperature, such as nonlinear optical excitation, can be simulated with high fidelity via real-time time dependent density functional theory (rt-TDDFT) calculations with hybrid functional. However, previous rt-TDDFT simulations of real materials using the optimal gauge--known as the parallel transport gauge--have been limited to low-temperature systems with band gaps. In this paper, we introduce the parallel transport-implicit midpoint (PT-IM) method, which significantly accelerates finite-temperature rt-TDDFT calculations of real materials with hybrid function. We first implement PT-IM with hybrid functional in our plane wave code PWDFT, and optimized it on both GPU and ARM platforms to build a solid baseline code. Next, we propose a diagonalization method to reduce computation and communication complexity, and then, we employ adaptively compressed exchange (ACE) method to reduce the frequency of the most expensive Fock exchange operator. Finally, we adopt the ring\_based method and the shared memory mechanism to overlap computation and communication and alleviate memory consumption respectively. Numerical results show that our optimized code can reach 3072 atoms for rt-TDDFT simulation with hybrid functional at finite temperature on 192 computing nodes, the time-to-solution for one time step is 429.3s, which is 41.4 times faster compared to the baseline.
Paper Structure (31 sections, 14 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 31 sections, 14 equations, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: The parallel distribution of wavefunction $\Phi$ (left: band-index parallelization; right: grid-point parallelization). Note that MPI_Alltoallv is required to transpose between the two parallelization schemes.
  • Figure 2: Evaluation of the Fock exchange operator. (a) Baseline. (b) Accelerated by diagnonalization.
  • Figure 3: Evaluation of $V_x\Phi$. (a) Direct two-electron integral. (b) ACE operator.
  • Figure 4: One time step propagation of the rt-TDDFT using (a) PT-IM (b) PT-IM-ACE with double loop to reduce the frequency of Fork exchange operator.
  • Figure 5: Communication pattern of wavefunctions across 4 processes. (a). Bcast-based method. (b). Ring-based point-to-point pattern. (c). Asynchronous ring-based method. The red two-way arrow solid line indicates MPI_Bcast communication, and the red one-way arrow solid line is point-to-point communication. The dashed red one-way arrow stands for asynchronous point-to-point communication. $\odot$ denotes element-wise multiplication between two wavefunctions.
  • ...and 6 more figures