Table of Contents
Fetching ...

$Δ$-ML Ensembles for Selecting Quantum Chemistry Methods to Compute Intermolecular Interactions

Austin M. Wallace, C. David Sherrill, Giri P. Krishnan

TL;DR

Problem: selecting accurate yet affordable quantum chemistry methods for intermolecular interactions. Approach: a $Δ$-ML ensemble trained on AP-Net2 embeddings to predict the error $ΔE_{ m pred}$ between any level of theory and the reference $E_{ m IE,ref}$ (CCSD(T)/CBS/CP), supplemented by compute-time estimators to prune expensive options. Contributions: demonstration on BFDB-Ext with 80x80 level mappings achieving $MAE<0.1$ kcal/mol, with concrete corrections such as HF/aug-cc-pVDZ/CP from $2.89$ to $0.08$ kcal/mol and MP2/aug-cc-pVQZ/CP from $0.21$ to $0.02$, and dendrograms showing alignment with theoretical hierarchies; time-based filtering enables practical large-scale screening. Significance: enables data-driven, scalable selection of levels of theory for screening in materials and drug discovery.

Abstract

Ab initio quantum chemical methods for accurately computing interactions between molecules have a wide range of applications but are often computationally expensive. Hence, selecting an appropriate method based on accuracy and computational cost remains a significant challenge due to varying performance of methods. In this work, we propose a framework based on an ensemble of $Δ$-ML models trained on features extracted from a pre-trained atom-pairwise neural network to predict the error of each method relative to all other methods including the ``gold standard'' coupled cluster with single, double, and perturbative triple excitations at the estimated complete basis set limit [CCSD(T)/CBS]. Our proposed approach provides error estimates across various levels of theories and identifies the computationally efficient approach for a given error range utilizing only a subset of the dataset. Further, this approach allows comparison between various theories. We demonstrate the effectiveness of our approach using an extended BioFragment dataset, which includes the interaction energies for common biomolecular fragments and small organic dimers. Our results show that the proposed framework achieves very small mean-absolute-errors below 0.1 kcal/mol regardless of the given method. Furthermore, by analyzing all-to-all $Δ$-ML models for present levels of theory, we identify method groupings that align with theoretical hypotheses, providing evidence that $Δ$-ML models can easily learn corrections from any level of theory to any other level of theory.

$Δ$-ML Ensembles for Selecting Quantum Chemistry Methods to Compute Intermolecular Interactions

TL;DR

Problem: selecting accurate yet affordable quantum chemistry methods for intermolecular interactions. Approach: a -ML ensemble trained on AP-Net2 embeddings to predict the error between any level of theory and the reference (CCSD(T)/CBS/CP), supplemented by compute-time estimators to prune expensive options. Contributions: demonstration on BFDB-Ext with 80x80 level mappings achieving kcal/mol, with concrete corrections such as HF/aug-cc-pVDZ/CP from to kcal/mol and MP2/aug-cc-pVQZ/CP from to , and dendrograms showing alignment with theoretical hierarchies; time-based filtering enables practical large-scale screening. Significance: enables data-driven, scalable selection of levels of theory for screening in materials and drug discovery.

Abstract

Ab initio quantum chemical methods for accurately computing interactions between molecules have a wide range of applications but are often computationally expensive. Hence, selecting an appropriate method based on accuracy and computational cost remains a significant challenge due to varying performance of methods. In this work, we propose a framework based on an ensemble of -ML models trained on features extracted from a pre-trained atom-pairwise neural network to predict the error of each method relative to all other methods including the ``gold standard'' coupled cluster with single, double, and perturbative triple excitations at the estimated complete basis set limit [CCSD(T)/CBS]. Our proposed approach provides error estimates across various levels of theories and identifies the computationally efficient approach for a given error range utilizing only a subset of the dataset. Further, this approach allows comparison between various theories. We demonstrate the effectiveness of our approach using an extended BioFragment dataset, which includes the interaction energies for common biomolecular fragments and small organic dimers. Our results show that the proposed framework achieves very small mean-absolute-errors below 0.1 kcal/mol regardless of the given method. Furthermore, by analyzing all-to-all -ML models for present levels of theory, we identify method groupings that align with theoretical hypotheses, providing evidence that -ML models can easily learn corrections from any level of theory to any other level of theory.

Paper Structure

This paper contains 7 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of methodology of using the BFDBext to train 80x80 $\Delta$AP-Net2 models for predicting from any level of theory in the dataset to another level of theory.
  • Figure 2: (a) BFDBExt dataset test error distributions for select levels of theory with respect to an estimated CCSD(T)/CBS/CP reference. The black horizontal line represents the mean error and the red horizontal lines represent the 5th and 95th percentiles. The uncorrected level of theory IE errors are in blue, while the $\Delta$AP-Net2 plus level of theory IE errors are in green. (b) Dendogram of select methods $\Delta$AP-Net2 model predicted error estimations ordered by MAE. Note the clusters of methods are nearly identical as the all-to-all M1 to M2 dendogram in the SI, meaning that the models are accurately predicting any M1 to M2. All levels of theory here are using CP.
  • Figure 3: BFDBExt dataset train error distributions for select levels of theory with respect to an estimated CCSD(T)/CBS/CP reference. The black horizontal line represents the mean error and the red horizontal lines represent the 5th and 95th percentiles. The uncorrected level of theory IE errors are in blue, while the $\delta$AP-Net2 plus level of theory IE errors are in green.
  • Figure 4: Dendogram of all-to-all $\delta$AP-Net2 model predicted error estimations ordered by MAE. Note the clusters of methods are nearly identical as the all-to-all M1 to M2 dendogram in the SI, meaning that the models are accurately predicting any M1 to M2.