Table of Contents
Fetching ...

LFaB: Low fidelity as Bias for Active Learning in the chemical configuration space

Vivin Vinod, Peter Zaspel

TL;DR

The paper tackles the inefficiency of variance-driven active learning in quantum-chemical surrogate modeling by introducing Low-Fidelity-as-Bias (LFaB), a bias-based sampling strategy that uses low-fidelity labels to approximate high-fidelity bias. LFaB selects samples with the largest predicted bias, achieving substantial reductions in required high-fidelity evaluations across QM7b atomization energies, VIB5 ab initio PES, and QeMFi excitation energies. In benchmarks, LFaB outperforms standard variance-based AL and often matches the greedy-optimal selection, reducing training data needs by up to an order of magnitude and enabling cost-effective, high-accuracy quantum-chemical models. The approach is simple to implement and leverages existing multifidelity concepts, offering a practical tool for efficient computational chemistry workflows.

Abstract

Active learning promises to provide an optimal training sample selection procedure in the construction of machine learning models. It often relies on minimizing the model's variance, which is assumed to decrease the prediction error. Still, it is frequently even less efficient than pure random sampling. Motivated by the bias-variance decomposition, we propose to minimize the model's bias instead of its variance. By doing so, we are able to almost exactly match the best-case error over all possible greedy sample selection procedures for a relevant application. Our bias approximation is based on using cheap to calculate low fidelity data as known from $Δ$-ML or multifidelity machine learning. We exemplify our approach for a wider class of applications in quantum chemistry including predicting excitation energies and ab initio potential energy surfaces. Here, the proposed method reduces training data consumption by up to an order of magnitude compared to standard active learning.

LFaB: Low fidelity as Bias for Active Learning in the chemical configuration space

TL;DR

The paper tackles the inefficiency of variance-driven active learning in quantum-chemical surrogate modeling by introducing Low-Fidelity-as-Bias (LFaB), a bias-based sampling strategy that uses low-fidelity labels to approximate high-fidelity bias. LFaB selects samples with the largest predicted bias, achieving substantial reductions in required high-fidelity evaluations across QM7b atomization energies, VIB5 ab initio PES, and QeMFi excitation energies. In benchmarks, LFaB outperforms standard variance-based AL and often matches the greedy-optimal selection, reducing training data needs by up to an order of magnitude and enabling cost-effective, high-accuracy quantum-chemical models. The approach is simple to implement and leverages existing multifidelity concepts, offering a practical tool for efficient computational chemistry workflows.

Abstract

Active learning promises to provide an optimal training sample selection procedure in the construction of machine learning models. It often relies on minimizing the model's variance, which is assumed to decrease the prediction error. Still, it is frequently even less efficient than pure random sampling. Motivated by the bias-variance decomposition, we propose to minimize the model's bias instead of its variance. By doing so, we are able to almost exactly match the best-case error over all possible greedy sample selection procedures for a relevant application. Our bias approximation is based on using cheap to calculate low fidelity data as known from -ML or multifidelity machine learning. We exemplify our approach for a wider class of applications in quantum chemistry including predicting excitation energies and ab initio potential energy surfaces. Here, the proposed method reduces training data consumption by up to an order of magnitude compared to standard active learning.

Paper Structure

This paper contains 12 sections, 12 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: A pictorial representation of the workflow employed to carry out numerical benchmarks of the low fidelity as bias (LFaB) method. The numerical benchmarks are performed for three datasets, QM7b montavon2013machine, VIB5 zhang_vib5_2022, and QeMFi vinod2024QeMFi_paper for atomization energies, ab initio potential energy surfaces, and excitation energies respectively. An initial set of molecular configurations is chosen to train the GPR model after making computations of the QC properties (or labels). The uncertainty of the trained model, either the variance or the bias, is estimated for the unlabeled molecular configurations in the training data pool. As seen at the lower right corner, LFaB uses the lower fidelity as a measure of the bias of the model trained at the higher fidelity. In this work, the LFaB scheme is benchmarked against alternative sampling techniques in active learning based on GPR model variance, variance of model ensembles, and against a greedy-optimal selection.
  • Figure 2: Learning curves for the prediction of atomization energies of the QM7b dataset with different sampling techniques for active learning. For the LFaB scheme, the low fidelities are expressed by the choice of the QC method with each pane showing a different basis set choice. Thus, the high fidelity is CCSD(T) with the low fidelity being either HF or MP2. Reference dashed lines in the right-hand side plot establish that the use of the LFaB sampling scheme reduces the number of active learning iterations required to achieve a certain error in comparison to the random selection of training samples. The LFaB method is seen to be superior to other variance based sampling techniques.
  • Figure 3: Learning curves showing prediction error (measured as MAE) versus number of training samples for the prediction of ab initio PES for $\rm CH_3Cl$ and $\rm CH_3F$ from the VIB5 database. Different active learning sampling schemes are used for each learning curve shown. The LFaB method outperforms the use of random sampling of training data in terms of prediction error. Furthermore, it results in a prediction error that is near identical to the greedy-optimal sampling approach outperforming both GPR variance and model ensemble variance.
  • Figure 4: Learning curves for the prediction of excitation energies of diverse molecules from the QeMFi dataset. Each data point is added after a cycle of active learning with the corresponding sampling scheme being used. The terms in the parenthesis for the LFaB measure indicate the lower fidelity used, in this case corresponding to the basis set chosen to make the quantum chemistry calculation. The random sampling approach is also shown for contrast, being the common approach that is followed in general machine learning in quantum chemistry workflows. In every case, the LFaB method performs better than the random sampling approach and other active learning sampling techniques.
  • Figure 5: PCA scatter plot for selected molecules studied in this work indicating points selected by the greedy-optimal and LFaB methods in the first 500 iterations. Points selected by both methods are indicated separately. The axes are scaled to lie with unitary values for the two first principle component (PC-1 and PC-2) of the CM molecular descriptor. The LFaB method selects almost all the same data points as the greedy-optimal method. This indicates the novel LFaB method is as good as the greedy-optimal selection of training data, however by a using a lower (and thereby cheaper) fidelity as reference data.
  • ...and 5 more figures