Table of Contents
Fetching ...

Implementation of the multigrid Gaussian-Plane-Wave algorithm with GPU acceleration in PySCF

Rui Li, Xing Zhang, Qiming Sun, Yuanheng Wang, Junjie Yang, Garnet Kin-Lic Chan

Abstract

We introduce a GPU-accelerated multigrid Gaussian-Plane-Wave density fitting (FFTDF) approach for efficient Fock builds and nuclear gradient evaluations within Kohn-Sham density functional theory, as implemented in the GPU4PySCF module of PySCF. Our CUDA kernels employ a grid-based parallelization strategy for contracting Gaussian basis function pairs and achieve up to 80% of the FP64 peak performance on NVIDIA GPUs, with no loss of efficiency for high angular momentum (up to f-shell) functions. Benchmark calculations on molecules and solids with up to 1536 atoms and 20480 basis functions show up to 25x speedup on an H100 GPU relative to the CPU implementation on a 28-core shared memory node. For a 256-water cluster, the ground-state energy and nuclear gradients can be computed in ~30 seconds on a single H100 GPU. This implementation serves as an open-source foundation for many applications, such as ab initio molecular dynamics and high-throughput calculations.

Implementation of the multigrid Gaussian-Plane-Wave algorithm with GPU acceleration in PySCF

Abstract

We introduce a GPU-accelerated multigrid Gaussian-Plane-Wave density fitting (FFTDF) approach for efficient Fock builds and nuclear gradient evaluations within Kohn-Sham density functional theory, as implemented in the GPU4PySCF module of PySCF. Our CUDA kernels employ a grid-based parallelization strategy for contracting Gaussian basis function pairs and achieve up to 80% of the FP64 peak performance on NVIDIA GPUs, with no loss of efficiency for high angular momentum (up to f-shell) functions. Benchmark calculations on molecules and solids with up to 1536 atoms and 20480 basis functions show up to 25x speedup on an H100 GPU relative to the CPU implementation on a 28-core shared memory node. For a 256-water cluster, the ground-state energy and nuclear gradients can be computed in ~30 seconds on a single H100 GPU. This implementation serves as an open-source foundation for many applications, such as ab initio molecular dynamics and high-throughput calculations.

Paper Structure

This paper contains 9 sections, 32 equations, 4 figures, 2 tables, 5 algorithms.

Figures (4)

  • Figure 1: Speedups of GPU4PySCF on NVIDIA H100 and A100 GPUs for a single SCF iteration and nuclear gradient calculation, relative to the corresponding PySCF CPU timings reported in Table \ref{['tab:timing']}.
  • Figure 2: Speedups of GPU4PySCF on NVIDIA H100 and A100 GPUs, and of CP2K on A100 GPUs and CPUs, for a single Fock build, relative to the corresponding PySCF CPU timings reported in Table \ref{['tab:timing_fock']}.
  • Figure 3: The computational time ratio for subroutines in an SCF cycle (a) and nuclear gradient calculation (b) for water clusters, benchmarked on H100 GPUs. In the legends, "Hxc" denotes Hxc potential, and "Pseudo" denotes pseudopotential.
  • Figure 4: FLOP performance of the custom CUDA kernels analyzed using the roofline model benchmarked on the NVIDIA A100 GPU. The solid blue line represents the official peak FP64 FLOP rate of 9.7 TFLOP/s with no bandwidth constraint (horizontal) and the peak FP64 FLOP rate constrained by the peak memory bandwidth of 1.6 TB/s (diagonal). The theoretical arithmetic intensity of 6.1 FLOP/byte marks the boundary between the memory-bound zone and the compute-bound zone for the A100 GPU. The benchmark calculations were performed for a 32-water cluster at the PBE/GTH-cc-pVQZ level of theory.