Benchmarking Data Efficiency in $Δ$-ML and Multifidelity Models for Quantum Chemistry
Vivin Vinod, Peter Zaspel
TL;DR
This work benchmarks data-generation costs for Δ-ML, MFML, o-MFML, and introduces MF$\Delta$ML on the QeMFi QC dataset to predict $E_{\text{gs}}$, $E_{(1)}$, $E_{(2)}$, and $|\boldsymbol{\mu}_e|$ across five fidelities. It shows that multifidelity approaches generally yield better data efficiency than pure Δ-ML, with MFML and MF$\Delta$ML delivering the strongest performance for large prediction sets, while MF$\Delta$ML offers advantages when only a few evaluations are needed. The study also clarifies the cost structure: Δ-ML incurs baseline QC costs, MFML avoids those costs by predicting the baseline, and MF$\Delta$ML combines Δ-ML with MFML to further improve efficiency. Overall, the results provide practical guidance on selecting MF- and Δ-ML strategies to minimize training data cost while achieving target accuracy for QC properties.
Abstract
The development of machine learning (ML) methods has made quantum chemistry (QC) calculations more accessible by reducing the compute cost incurred in conventional QC methods. This has since been translated into the overhead cost of generating training data. Increased work in reducing the cost of generating training data resulted in the development of $Δ$-ML and multifidelity machine learning methods which use data at more than one QC level of accuracy, or fidelity. This work compares the data costs associated with $Δ$-ML, multifidelity machine learning (MFML), and optimized MFML (o-MFML) in contrast with a newly introduced Multifidelity$Δ$-Machine Learning (MF$Δ$ML) method for the prediction of ground state energies, vertical excitation energies, and the magnitude of electronic contribution of molecular dipole moments from the multifidelity benchmark dataset QeMFi. This assessment is made on the basis of training data generation cost associated with each model and is compared with the single fidelity kernel ridge regression (KRR) case. The results indicate that the use of multifidelity methods surpasses the standard $Δ$-ML approaches in cases of a large number of predictions. For applications which require only a few evaluations to be made using ML models, while the $Δ$-ML method might be favored, the MF$Δ$ML method is shown to be more efficient.
