Table of Contents
Fetching ...

Optimal design of experiments in the context of machine-learning inter-atomic potentials: improving the efficiency and transferability of kernel based methods

Bartosz Barzdajn, Christopher P. Race

TL;DR

The paper tackles the problem of constructing data-efficient, transferable kernel-based interatomic potentials by applying optimal design of experiments to select informative training configurations offline. It formulates a priori designs within the GAP/GPR framework, using a max-min kernel-distance criterion to assemble diverse training sets without requiring labeled data upfront. Empirical results on elastically deformed Zr configurations show that optimised training sets improve off-sample performance and transferability compared with representative samples, while providing objective quality measures for dataset design. This approach offers a practically light-weight alternative to heavy active-learning pipelines and can accelerate robust, generalisable potential development in material simulations.

Abstract

Data-driven, machine learning (ML) models of atomistic interactions are often based on flexible and non-physical functions that can relate nuanced aspects of atomic arrangements into predictions of energies and forces. As a result, these potentials are as good as the training data (usually results of so-called ab initio simulations) and we need to make sure that we have enough information for a model to become sufficiently accurate, reliable and transferable. The main challenge stems from the fact that descriptors of chemical environments are often sparse high-dimensional objects without a well-defined continuous metric. Therefore, it is rather unlikely that any ad hoc method of choosing training examples will be indiscriminate, and it will be easy to fall into the trap of confirmation bias, where the same narrow and biased sampling is used to generate train- and test- sets. We will demonstrate that classical concepts of statistical planning of experiments and optimal design can help to mitigate such problems at a relatively low computational cost. The key feature of the method we will investigate is that they allow us to assess the informativeness of data (how much we can improve the model by adding/swapping a training example) and verify if the training is feasible with the current set before obtaining any reference energies and forces -- a so-called off-line approach. In other words, we are focusing on an approach that is easy to implement and doesn't require sophisticated frameworks that involve automated access to high-performance computational (HPC).

Optimal design of experiments in the context of machine-learning inter-atomic potentials: improving the efficiency and transferability of kernel based methods

TL;DR

The paper tackles the problem of constructing data-efficient, transferable kernel-based interatomic potentials by applying optimal design of experiments to select informative training configurations offline. It formulates a priori designs within the GAP/GPR framework, using a max-min kernel-distance criterion to assemble diverse training sets without requiring labeled data upfront. Empirical results on elastically deformed Zr configurations show that optimised training sets improve off-sample performance and transferability compared with representative samples, while providing objective quality measures for dataset design. This approach offers a practically light-weight alternative to heavy active-learning pipelines and can accelerate robust, generalisable potential development in material simulations.

Abstract

Data-driven, machine learning (ML) models of atomistic interactions are often based on flexible and non-physical functions that can relate nuanced aspects of atomic arrangements into predictions of energies and forces. As a result, these potentials are as good as the training data (usually results of so-called ab initio simulations) and we need to make sure that we have enough information for a model to become sufficiently accurate, reliable and transferable. The main challenge stems from the fact that descriptors of chemical environments are often sparse high-dimensional objects without a well-defined continuous metric. Therefore, it is rather unlikely that any ad hoc method of choosing training examples will be indiscriminate, and it will be easy to fall into the trap of confirmation bias, where the same narrow and biased sampling is used to generate train- and test- sets. We will demonstrate that classical concepts of statistical planning of experiments and optimal design can help to mitigate such problems at a relatively low computational cost. The key feature of the method we will investigate is that they allow us to assess the informativeness of data (how much we can improve the model by adding/swapping a training example) and verify if the training is feasible with the current set before obtaining any reference energies and forces -- a so-called off-line approach. In other words, we are focusing on an approach that is easy to implement and doesn't require sophisticated frameworks that involve automated access to high-performance computational (HPC).
Paper Structure (8 sections, 5 equations, 7 figures, 2 algorithms)

This paper contains 8 sections, 5 equations, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of the problem of double mapping using the example of the EAM model. Here we focus only on the contribution of a specific atom to the total energy and consider only the embedding representing the many-body interactions. The density $\bar{\rho}$ in principle refers to the local electronic density and consists of contributions $\rho\left(r_{i}\right)$ from neighbouring atoms, where $\rho$ is a non-linear function that can also depend on adjustable parameters, while $r_{i}$ represents the distance to a specific neighbour. We also assume that considering the first N neighbours provides almost complete information. The quantity $\bar{\rho}$ can be considered as a 1-dimensional descriptor of the local environment. The contribution to the energy will be a non-linear function of this quantity. However, in this example we assume that it can be expressed in a linear basis to illustrate the model complexity represented by the number of features M. An example of a feature can be $\bar{\rho}$ raised to the k-th power in the polynomial representation of the mapping.
  • Figure 2: Illustration of the Golchi and Loeppky algorithm Golchi_Monte__2016 for max-min designs, which maximises the minimum distance, using the example of nine points on a square grid. The state in each iteration is defined by the design vector $S$ and vector $\Psi$. The vector $\Psi$ can be considered as a 'decision' function while $X$ represents the pool of candidates. Initially, $\Psi$ consists of distances between the first element and the remaining elements. In the following iterations, $\Psi$ is updated element-wise according to $\Psi_{i}^{(p)}=\min\left(\Psi_{i}^{(p-1)},\delta_{i}^{(p)}\right)$, where p is the iteration index, $\delta_{i}^{(p)}$ is the distance between $p\mathrm{-th}$ and $i\mathrm{-th}$ candidate and $\delta\left(\cdot\right)$ is the distance function. In each iteration we find the maximum value of update $\Psi$. Position of this element indicates the new optimal design point.
  • Figure 3: Application of the Golchi and Loeppky algorithm Golchi_Monte__2016 for max-min designs. In this example, we are optimising the training set for Gaussian process regression. The distance measure follows the definition \ref{['eq:k_dist']} and is based on a Gaussian kernel with unit scale parameter. The training points are selected from a pool of candidates generated using strongly biased sampling. This pool consist of $10^{4}$ samples from the normal distribution $\mathcal{N}\left(0,0.5\right)$. On the plots they are represented by small grey points with opacity. Such an example simulates conditions under which we have to select atomistic configuration for training of ML potentials. The first plot on the left illustrates a representative design, which is a random sub-sample of 100 candidates. This plot also illustrates how biased the underlying sampling is. Next, in the middle, we show the distribution in the optimised training set, also consisting of 100 examples. The index indicates the order of addition of the first 9 optimal points, revealing the algorithm's strategy of filling empty spaces after enclosing a domain. Finally, the plot on the right shows how the entropy changes when we replace the random/representative examples with the optimised ones. Entropy for representative sets corresponds to different realisations of this design. Here, the entropy of the resulting kernel matrix $K$ is proportional to $\log\left(\det K\right)$Shewry_Maximum_JoAS_1987. Note that the aim is not to provide a uniform importance sampling, but optimal training points for a given regression method. In this case, however, these objectives coincide so that we can visually assess the quality of the solution. Other models may have a different solution. See the example a) in figure \ref{['fig:design_example']}. Finally, this results can be easily recreated using the alghorithm \ref{['alg:opt_grid']} from appendix\ref{['sec:maxmin_python']}. We only need to replace the
  • Figure 4: Comparison of pairwise distances within a random and an optimised training set. Figure a) shows normalised probability histograms of the kernel distance. Figure b) shows the distribution within a training set of the $L_{2}$norm of the deformation tensor, which we use to quantify its magnitude. The lack of a strong difference between the locations of these distributions shows that the reduction in overlap has not been achieved by simply pushing the amount of deformation.
  • Figure 5: Examples of performance when predicting energy per atom on training and validation sets. Brightness indicates prediction error from fig. \ref{['fig:cross_val']}.
  • ...and 2 more figures