Table of Contents
Fetching ...

3-center and 4-center 2-particle Gaussian AO integrals on modern accelerated processors

Andrey Asadchev, Edward F. Valeev

TL;DR

This work extends the matrix-form McMurchie-Davidson approach to efficient GPU evaluation of 3-center and 4-center 2-electron Gaussian AO integrals across low to high angular momenta ($l\leq 6$). It introduces three MD variants (V0, V1, V2) tailored to different memory and compute regimes, enabling substantial GPU throughput—up to 25–70% of peak—via data-layout optimizations and batched GEMMs, with detailed performance benchmarks. A preliminary exchange-operator implementation demonstrates practical applicability to large systems (tens of thousands of AOs) and is integrated into the LibintX open-source library (LGPL3, C++17/CUDA). The work highlights the viability of a matrix MD-based engine for modern accelerators and outlines future enhancements, including CPU SIMD adaptations, complete exchange machinery, and derivatives for advanced electronic-structure methods.

Abstract

We report an implementation of the McMurchie-Davidson (MD) algorithm for 3-center and 4-center 2-particle integrals over Gaussian atomic orbitals (AOs) with low and high angular momenta $l$ and varying degrees of contraction for graphical processing units (GPUs). This work builds upon our recent implementation of a matrix form of the MD algorithm that is efficient for GPU evaluation of 4-center 2-particle integrals over Gaussian AOs of high angular momenta ($l\geq 4$) [$\mathit{J. Phys. Chem. A}\ \mathbf{127}$, 10889 (2023)]. The use of unconventional data layouts and three variants of the MD algorithm allow to evaluate integrals in double precision with sustained performance between 25% and 70% of the theoretical hardware peak. Performance assessment includes integrals over AOs with $l\leq 6$ (higher $l$ is supported). Preliminary implementation of the Hartree-Fock exchange operator is presented and assessed for computations with up to quadruple-zeta basis and more than 20,000 AOs. The corresponding C++ code is a part of the experimental open-source $\mathtt{LibintX}$ library available at $\mathbf{github.com:ValeevGroup/LibintX}$.

3-center and 4-center 2-particle Gaussian AO integrals on modern accelerated processors

TL;DR

This work extends the matrix-form McMurchie-Davidson approach to efficient GPU evaluation of 3-center and 4-center 2-electron Gaussian AO integrals across low to high angular momenta (). It introduces three MD variants (V0, V1, V2) tailored to different memory and compute regimes, enabling substantial GPU throughput—up to 25–70% of peak—via data-layout optimizations and batched GEMMs, with detailed performance benchmarks. A preliminary exchange-operator implementation demonstrates practical applicability to large systems (tens of thousands of AOs) and is integrated into the LibintX open-source library (LGPL3, C++17/CUDA). The work highlights the viability of a matrix MD-based engine for modern accelerators and outlines future enhancements, including CPU SIMD adaptations, complete exchange machinery, and derivatives for advanced electronic-structure methods.

Abstract

We report an implementation of the McMurchie-Davidson (MD) algorithm for 3-center and 4-center 2-particle integrals over Gaussian atomic orbitals (AOs) with low and high angular momenta and varying degrees of contraction for graphical processing units (GPUs). This work builds upon our recent implementation of a matrix form of the MD algorithm that is efficient for GPU evaluation of 4-center 2-particle integrals over Gaussian AOs of high angular momenta () [, 10889 (2023)]. The use of unconventional data layouts and three variants of the MD algorithm allow to evaluate integrals in double precision with sustained performance between 25% and 70% of the theoretical hardware peak. Performance assessment includes integrals over AOs with (higher is supported). Preliminary implementation of the Hartree-Fock exchange operator is presented and assessed for computations with up to quadruple-zeta basis and more than 20,000 AOs. The corresponding C++ code is a part of the experimental open-source library available at .
Paper Structure (10 sections, 20 equations, 7 tables)