Table of Contents
Fetching ...

Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

Vivin Vinod, Peter Zaspel

TL;DR

This work addresses the data-hierarchy challenge in multifidelity machine learning for quantum chemistry excitation energies by systematically varying the fidelity-sample scaling factor $\gamma$ and introducing time-cost informed scaling factors $\theta$, along with a novel MFML error-contour metric and the $\Gamma$-curve. Using the QeMFi benchmark and a Coulomb-matrix representation with Kernel Ridge Regression, the authors construct MFML and o-MFML models across five fidelities and study how sample allocation across fidelities affects prediction accuracy and training cost. Key findings show that high accuracy can be achieved with very few high-fidelity samples if a larger number of cheaper-fidelity samples are used, and that the $\Gamma$-curve can yield favorable time–error trade-offs, though fixed $\gamma$ scaling frequently remains robust. The study also highlights the importance of data distribution across fidelities (e.g., STO3G) and reports limited transferability to markedly different molecular sets (QUESTDB), motivating future work on broader functional coverage and optimization of fidelity sampling strategies. Overall, the paper provides a data-hierarchy framework that enables cost-efficient multifidelity predictions of excitation energies, with practical implications for scalable QC workflows.

Abstract

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $γ$, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $γ$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $θ$, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $Γ$-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.

Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

TL;DR

This work addresses the data-hierarchy challenge in multifidelity machine learning for quantum chemistry excitation energies by systematically varying the fidelity-sample scaling factor and introducing time-cost informed scaling factors , along with a novel MFML error-contour metric and the -curve. Using the QeMFi benchmark and a Coulomb-matrix representation with Kernel Ridge Regression, the authors construct MFML and o-MFML models across five fidelities and study how sample allocation across fidelities affects prediction accuracy and training cost. Key findings show that high accuracy can be achieved with very few high-fidelity samples if a larger number of cheaper-fidelity samples are used, and that the -curve can yield favorable time–error trade-offs, though fixed scaling frequently remains robust. The study also highlights the importance of data distribution across fidelities (e.g., STO3G) and reports limited transferability to markedly different molecular sets (QUESTDB), motivating future work on broader functional coverage and optimization of fidelity sampling strategies. Overall, the paper provides a data-hierarchy framework that enables cost-efficient multifidelity predictions of excitation energies, with practical implications for scalable QC workflows.

Abstract

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, , to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as , that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the -curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.

Paper Structure

This paper contains 15 sections, 11 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: A hypothetical comparison of training data used across fidelities for the different kinds of scaling factors used in this work. a) The multifidelity training data structure used in MFML with a small fixed scaling factor ($\gamma$). b) Multifidelity training data structure for a large fixed scaling factor ($\gamma$) results in a larger number of training samples being used at the cheaper fidelities. c) The structure of multifidelity training data used for scaling factors that are decided based on the QC-time cost, explained in \ref{['scalingfac']} as $\theta_f^F$ and $\theta_{f-1}^f$. d) Comparison of training data structure evolution for conventional MFML and the $\Gamma$-curve introduced in \ref{['gammacurve_theory']}. Notice how the number of training samples used at the target (the costliest) fidelity remain same across the data structure for the $\Gamma$-curve while they increase for the conventional MFML method.
  • Figure 2: Multifidelity learning curves for the prediction of excitation energies taken from the QeMFi dataset. The top row corresponds to the MFML models while the bottom row is for the o-MFML models. Different fixed scaling factors are used to scale the data across each fidelity in the multifidelity models as explained \ref{['scalingfac']}. The scaling factors are reported on the top of each column.
  • Figure 3: MFML and o-MFML learning curves for scaling factors, $\theta_{f-1}^f$, between fidelities chosen as ratios of the QC compute time of subsequent fidelities. Single fidelity KRR at TZVP is also shown for reference. Single fidelity KRR learning curves are also provided for reference. The legend describes the baseline fidelity, $f_b$, of the multifidelity model.
  • Figure 4: MFML and o-MFML learning curves for scaling factors, $\theta_f^F$, between fidelities selected as ratios of the QC compute time of that fidelity to the compute time of TZVP, that is the target fidelity. Single fidelity KRR learning curves are also provided for reference. The legend describes the baseline fidelity, $f_b$, of the multifidelity model.
  • Figure 5: Comparison of learning curves for fixed scaling factors $\gamma$, $\theta_{f-1}^f$, and $\theta_f^F$ with $f_b$: STO3G. The x-axis reports the number of training samples used at the highest fidelity, that is, TZVP. Both MFML and o-MFML models are compared. Increasing values of $\gamma$ result in a constant lowered offset of the learning curves. The cost informed scaling factors show a higher value of MAE.
  • ...and 5 more figures