Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

Qiujiang Liang; Jun Yang

Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

Qiujiang Liang, Jun Yang

Abstract

The computational acceleration of orbital-invariant local correlation methods on graphics processing units (GPUs) has remained largely unexplored due to substantial algorithmic complexities. The runtime efficiency of GPU-implemented local correlation theories can be significantly constrained by the parallelizable degree of the orbital localization procedure, the iterative solution of the local wave function, and the adaptation of CUDA kernels to inherently local or sparse operations. Using the second-order Møller-Plesset perturbation (MP2) theory, we present a multi-GPU implementation for large-scale third-order many-body expansion orbital-specific virtual MP2 (MBE(3)-OSV-MP2) energy calculations. Accordingly, our algorithms and implementation address the GPU parallelization ability for peak utilization and parallelism of local MP2 computation in several aspects, including Jacobi-Pipek-Mezey localization, randomized OSV generation, direct MP2 integral regeneration, as well as CUDA kernel adaptation to local operations. The GPU-based MBE(3)-OSV-MP2 energy computation achieves $O(N^{1.9})$ scaling and 84\% parallel efficiency up to 24 GPUs distributed on multiple nodes. The present implementation delivers 40-fold wall-time speedup of the canonical RI-MP2 and 10-fold speedup of the CPU-based MBE(3)-OSV-MP2 for (H$_2$O)$_{128}$/cc-pVDZ and (H$_2$O)$_{190}$/cc-pVDZ, respectively. A large scale computation of 784-atom insulin peptide yields the full MBE(3)-OSV-MP2 energies in 24 minutes with cc-pVDZ (7571 basis functions) and 6.4 hours with cc-pVTZ (17448 basis functions) on 8 NVIDIA A800 GPUs. Our work opens up new possibilities for performing fast GPU-based local correlation calculations on real-life macromolecules.

Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

Abstract

scaling and 84\% parallel efficiency up to 24 GPUs distributed on multiple nodes. The present implementation delivers 40-fold wall-time speedup of the canonical RI-MP2 and 10-fold speedup of the CPU-based MBE(3)-OSV-MP2 for (H

/cc-pVDZ and (H

/cc-pVDZ, respectively. A large scale computation of 784-atom insulin peptide yields the full MBE(3)-OSV-MP2 energies in 24 minutes with cc-pVDZ (7571 basis functions) and 6.4 hours with cc-pVTZ (17448 basis functions) on 8 NVIDIA A800 GPUs. Our work opens up new possibilities for performing fast GPU-based local correlation calculations on real-life macromolecules.

Paper Structure (14 sections, 27 equations, 5 figures, 2 tables, 7 algorithms)

This paper contains 14 sections, 27 equations, 5 figures, 2 tables, 7 algorithms.

Introduction
MBE(3)-OSV-MP2 Method
Implementation for Multi-GPU Computing
Direct Density Fitting Integrals
Occupied Orbital Localization
Randomized OSV generation
OSV Overlap and Fock Computation
OSV Exchange Integrals
MBE(3)-OSV-MP2 Residual Equations
Results
GPU Parallel Scalability with Molecular Sizes
Speedup Scalability with GPU numbers
Large Molecules
Conclusions

Figures (5)

Figure 1: a. Multi-GPU parallel architecture. b. Coalesced GPU memory access. c. GPU parallel scheme for MBE(3)-OSV-MP2.
Figure 2: The comparison of the total Wall time (seconds) as a function of the number of atomic orbitals for MBE(3)-OSV-MP2 and its individual computing components using one NVIDIA A800 (80 GB) GPU. All timings of polyglycines $(\text{Gly})_{n}$ ($n=4,8,...,40$) were obtained using def2-TZVP/def2-TZVP-RIFit basis sets.
Figure 3: Strong scaling performance of MBE(3)-OSV-MP2 without localization with respect to the number of GPUs (A800/80 GB). Each computing node is equipped with 8 GPUs. The parallel efficiency refers to the percentage of the actual acceleration to the ideal acceleration fold. Calculations were performed using cc-pVDZ/cc-pVDZ-RIFit basis sets for both (H$_2$O)$_{100}$ and (H$_2$O)$_{300}$ clusters.
Figure 4: Wall time (seconds) comparison between GPU-accelerated MBE(3)-OSV-MP2, ByteQC's RI-MP2guo2025byteqc, as well as computational time (seconds) of EXESS's RI-MP2snowdon2024efficient. Dashed curves indicate the speedup of MBE(3)‑OSV‑MP2 relative to RI‑MP2. RI-MP2 calculations were performed on an A100 (80 GB) GPUguo2025byteqcsnowdon2024efficient, while MBE(3)-OSV-MP2 calculations were carried out on an A800 (80 GB) GPU. For MBE(3)-OSV-MP2 calculations, the structures of water clusters were optimized using the QM-polarizable water model ChargeNN liang2025polarizable.
Figure 5: Wall time (seconds) for (a) localization and (b) subsequent MBE(3)-OSV-MP2 processes, measured on a single NVIDIA A800 (80 GB) GPU and 64 Intel Xeon Platinum 8360Y (2.40 GHz) CPU cores. The coordinates of C$_{60}$@catcher and $(\text{H}_{2}\text{O})_{190}$ were taken from Refs. sure2015comprehensive and watergeo, respectively.

Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

Abstract

Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

Authors

Abstract

Table of Contents

Figures (5)