Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies
Vivin Vinod, Peter Zaspel
TL;DR
This work addresses the data-hierarchy challenge in multifidelity machine learning for quantum chemistry excitation energies by systematically varying the fidelity-sample scaling factor $\gamma$ and introducing time-cost informed scaling factors $\theta$, along with a novel MFML error-contour metric and the $\Gamma$-curve. Using the QeMFi benchmark and a Coulomb-matrix representation with Kernel Ridge Regression, the authors construct MFML and o-MFML models across five fidelities and study how sample allocation across fidelities affects prediction accuracy and training cost. Key findings show that high accuracy can be achieved with very few high-fidelity samples if a larger number of cheaper-fidelity samples are used, and that the $\Gamma$-curve can yield favorable time–error trade-offs, though fixed $\gamma$ scaling frequently remains robust. The study also highlights the importance of data distribution across fidelities (e.g., STO3G) and reports limited transferability to markedly different molecular sets (QUESTDB), motivating future work on broader functional coverage and optimization of fidelity sampling strategies. Overall, the paper provides a data-hierarchy framework that enables cost-efficient multifidelity predictions of excitation energies, with practical implications for scalable QC workflows.
Abstract
Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $γ$, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $γ$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $θ$, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $Γ$-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.
