Implementation of McMurchie-Davidson algorithm for Gaussian AO integrals suited for SIMD processors
Andrey Asadchev, Edward F. Valeev
TL;DR
The paper addresses the efficient evaluation of Gaussian AO integrals on SIMD-enabled CPUs by adapting the McMurchie-Davidson scheme to batch multiple shellsets, enabling high utilization of vector units. It combines algorithmic improvements (external Coulomb transforms, early contraction, and 3-center optimizations) with carefully vectorized primitives and Boys function handling to achieve near-peak FP64 performance across AVX2, AVX512, and NEON, while avoiding code generation. The authors demonstrate substantial speedups over the traditional Libint Obara-Saika implementation across 1-, 3-, and 4-center 2-particle integrals, and show favorable comparisons against Simint for higher angular momenta, all within an open-source LibintX framework. These results indicate a practical, portable CPU kernel design that aligns CPU and GPU integral engines, enabling scalable, architecture-agnostic performance for quantum chemical simulations.
Abstract
We report an implementation of the McMurchie-Davidson evaluation scheme for 1- and 2-particle Gaussian AO integrals designed for processors with Single Instruction Multiple Data (SIMD) instruction sets. Like in our recent MD implementation for graphical processing units (GPUs) [J. Chem. Phys. 160, 244109 (2024)], variable-sized batches of shellsets of integrals are evaluated at a time. By optimizing for the floating point instruction throughput rather than minimizing the number of operations, this approach achieves up to 50% of the theoretical hardware peak FP64 performance for many common SIMD-equipped platforms (AVX2, AVX512, NEON), which translates to speedups of up to 30 over the state-of-the-art one-shellset-at-a-time implementation of Obara-Saika-type schemes in Libint for a variety of primitive and contracted integrals. As with our previous work, we rely on the standard C++ programming language -- such as the std::simd standard library feature to be included in the 2026 ISO C++ standard -- without any explicit code generation to keep the code base small and portable. The implementation is part of the open source LibintX library freely available at https://github.com/ValeevGroup/libintx.
