Table of Contents
Fetching ...

Multi-Fidelity Machine Learning for Excited State Energies of Molecules

Vivin Vinod, Sayan Maity, Peter Zaspel, Ulrich Kleinekathöfer

TL;DR

The paper tackles the high cost of obtaining accurate excited-state energies by introducing a multi-fidelity machine learning framework based on kernel ridge regression. By fusing a small set of high-fidelity TD-DFT data with larger sets from cheaper fidelities, the approach preserves high-accuracy predictions while dramatically reducing offline data generation, as demonstrated on benzene, naphthalene, and anthracene along MD and DFTB trajectories. MFML achieves predictive accuracy comparable to single-fidelity high-cost models, with data-generation time reduced by over a factor of $30$, and substantial gains expected for larger systems and more demanding electronic-structure methods. These results show that hierarchical fidelity data can be exploited to enable scalable, trajectory-aware excited-state energetics for complex molecular assemblies and photophysical processes.

Abstract

The accurate but fast calculation of molecular excited states is still a very challenging topic. For many applications, detailed knowledge of the energy funnel in larger molecular aggregates is of key importance requiring highly accurate excited state energies. To this end, machine learning techniques can be an extremely useful tool though the cost of generating highly accurate training datasets still remains a severe challenge. To overcome this hurdle, this work proposes the use of multi-fidelity machine learning where very little training data from high accuracies is combined with cheaper and less accurate data to achieve the accuracy of the costlier level. In the present study, the approach is employed to predict the first excited state energies for three molecules of increasing size, namely, benzene, naphthalene, and anthracene. The energies are trained and tested for conformations stemming from classical molecular dynamics simulations and from real-time density functional tight-binding calculations. It can be shown that the multi-fidelity machine learning model can achieve the same accuracy as a machine learning model built only on high cost training data while having a much lower computational effort to generate the data. The numerical gain observed in these benchmark test calculations was over a factor of 30 but certainly can be much higher for high accuracy data.

Multi-Fidelity Machine Learning for Excited State Energies of Molecules

TL;DR

The paper tackles the high cost of obtaining accurate excited-state energies by introducing a multi-fidelity machine learning framework based on kernel ridge regression. By fusing a small set of high-fidelity TD-DFT data with larger sets from cheaper fidelities, the approach preserves high-accuracy predictions while dramatically reducing offline data generation, as demonstrated on benzene, naphthalene, and anthracene along MD and DFTB trajectories. MFML achieves predictive accuracy comparable to single-fidelity high-cost models, with data-generation time reduced by over a factor of , and substantial gains expected for larger systems and more demanding electronic-structure methods. These results show that hierarchical fidelity data can be exploited to enable scalable, trajectory-aware excited-state energetics for complex molecular assemblies and photophysical processes.

Abstract

The accurate but fast calculation of molecular excited states is still a very challenging topic. For many applications, detailed knowledge of the energy funnel in larger molecular aggregates is of key importance requiring highly accurate excited state energies. To this end, machine learning techniques can be an extremely useful tool though the cost of generating highly accurate training datasets still remains a severe challenge. To overcome this hurdle, this work proposes the use of multi-fidelity machine learning where very little training data from high accuracies is combined with cheaper and less accurate data to achieve the accuracy of the costlier level. In the present study, the approach is employed to predict the first excited state energies for three molecules of increasing size, namely, benzene, naphthalene, and anthracene. The energies are trained and tested for conformations stemming from classical molecular dynamics simulations and from real-time density functional tight-binding calculations. It can be shown that the multi-fidelity machine learning model can achieve the same accuracy as a machine learning model built only on high cost training data while having a much lower computational effort to generate the data. The numerical gain observed in these benchmark test calculations was over a factor of 30 but certainly can be much higher for high accuracy data.
Paper Structure (13 sections, 12 equations, 5 figures)

This paper contains 13 sections, 12 equations, 5 figures.

Figures (5)

  • Figure 1: Multi-fidelity machine learning (MFML) based on kernel ridge regression significantly reduces the cost of training a machine learning model for the prediction of quantum chemistry properties, here, excited state energies. In contrast to the conventional single-fidelity machine learning method, the discussed method uses data from multiple fidelities with a few highly accurate (and costly) data samples and a growing number of less accurate (hence usually cheaper) data samples, thereby reducing the overall computational cost for the generation of the training data. This procedure helps to expedite the machine learning pipeline for predicting (first) excited state energies.
  • Figure 2: A) Energy distributions of the different fidelities (basis sets) in the training sets based on the MD trajectories of benzene, naphthalene, and anthracene. To aid visibility, the STO-3G distributions (yellow) have been shifted down in energy by 0.5 eV for all molecules. The complete training data for each fidelity is represented in terms of the density plot obtained using the kernel density estimation. B) Scatter plots comparing the excitation energies using the TZVP basis set to the excitation energies at the other fidelities (basis sets) for the conformations in the training data. Again, the STO-3G distributions (yellow) have been shifted down in energy by 0.5 eV. C) Energy differences (including standard deviations) between the different fidelities and the target fidelity (TZVP) for the conformations in the training data. A hierarchy in the accuracy of the different excited state calculations is a necessary condition for the working of the MFML approach. D) Learning curves for the single-fidelity KRR model presented on a double-logarithmic scale.
  • Figure 3: The effectiveness of the MFML method is represented through learning curves, while the results for the evaluation set are also analyzed in the time and energy domains. A) Multi-fidelity learning curves based on the excited state energies along the MD trajectories for benzene, naphthalene, and anthracene. With the addition of lower fidelities, the prediction error decreases, as can be seen in the difference between the standard KRR model (blue) and the MFML model using data from all five fidelities (yellow). B) Energy distributions based on the holdout sets using the TZVP reference calculations (red) and the predictions from the MFML model $P_{\rm MFML}^{(\rm TZVP;STO-3G)}$ for $N_{\rm train}^{\rm TZVP}=512$ (black). In all cases, it can be observed that the predictions from the MFML model matches the reference energy distributions accurately. C) The corresponding time autocorrelation functions (ACFs) of the excited state energies. The red lines correspond to the ACFs of the TZVP reference calculations from the holdout set, while the black lines report the ACF of the excited state energies predicted from the MFML model for the conformations belonging to this set.
  • Figure 4: Computation times to generate the training data sets versus the MAE of the MFML models, verifying the computational benefits of the MFML models. A) Results for the MD trajectories: With addition of each additional numerically cheaper fidelity, the training time decreases for a specific MAE, i.e., prediction accuracy, in the cases with a clear hierarchy in the fidelities. B) Findings for the DFTB trajectories: The time benefits are clearly visible for the various molecules across the fidelities. For anthracene, the yellow line corresponding to the MFML model built on STO-3G does not provide any time improvement due to the various aforementioned reasons.
  • Figure 5: Computational time to generate the MFML training set versus the MAE for benzene. The results for the MD-based trajectory are presented on the left-hand side, while the right-hand side shows the results for the DFTB-based trajectory. The target fidelity is set to QZVP. Additionally, two semi-empirical methods, ZINDO and LC-DFTB were employed. For each numerically cheaper fidelity that is added into the model, clear offsets of the learning curves can be observed.