Table of Contents
Fetching ...

Understanding multi-fidelity training of machine-learned force-fields

John L. A. Gardner, Hannes Schulz, Jean Helie, Lixin Sun, Gregor N. C. Simm

Abstract

This study systematically investigates two multi-fidelity strategies used to train machine-learned force fields (MLFFs) -- pre-training/fine-tuning and multi-headed training -- and elucidates the mechanisms underpinning their success. For pre-training and fine-tuning, we uncover a log-log linear relationship between pre-trained and fine-tuned accuracies that holds across model architectures, model sizes, and quantum-chemical methods. The success of this approach hinges on the quantity and quality of available pre-training data, and, critically, the inclusion of force labels. We demonstrate that pre-trained representations are inherently method-specific, requiring adaptation of the model backbone during fine-tuning. In contrast, multi-headed models learn method-independent backbone representations, where again the heads' accuracies are log-log linearly related. Relative to pre-training and fine-tuning, these shared representations marginally reduce model performance in most cases. However, this trade-off is offset by practical advantages: multi-headed training extends naturally to multiple labelling methods and enables partial replacement of expensive labels with cheaper alternatives, paving the way towards cost-efficient universal MLFFs.

Understanding multi-fidelity training of machine-learned force-fields

Abstract

This study systematically investigates two multi-fidelity strategies used to train machine-learned force fields (MLFFs) -- pre-training/fine-tuning and multi-headed training -- and elucidates the mechanisms underpinning their success. For pre-training and fine-tuning, we uncover a log-log linear relationship between pre-trained and fine-tuned accuracies that holds across model architectures, model sizes, and quantum-chemical methods. The success of this approach hinges on the quantity and quality of available pre-training data, and, critically, the inclusion of force labels. We demonstrate that pre-trained representations are inherently method-specific, requiring adaptation of the model backbone during fine-tuning. In contrast, multi-headed models learn method-independent backbone representations, where again the heads' accuracies are log-log linearly related. Relative to pre-training and fine-tuning, these shared representations marginally reduce model performance in most cases. However, this trade-off is offset by practical advantages: multi-headed training extends naturally to multiple labelling methods and enables partial replacement of expensive labels with cheaper alternatives, paving the way towards cost-efficient universal MLFFs.

Paper Structure

This paper contains 30 sections, 6 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Comparison of two multi-fidelity training strategies for MLFFs considered in this work. (a) A sequential pre-training and fine-tuning approach. The model is first trained on a large dataset of low-fidelity labels (blue, e.g., DFT) and subsequently fine-tuned on a smaller dataset of high-fidelity labels (black, e.g., CC). (b) A multi-headed approach where a shared model backbone learns from multiple label fidelities simultaneously, each through a dedicated readout head.
  • Figure 2: Overview of the dataset partitioning and labelling scheme. We partition the dataset into four non-overlapping subsets termed a, b, c, and t. There are CC, DFT, and xTB labels for each structure in each subset. Unless otherwise stated, we test all models on the CC-labelled hold-out test set t. Subsets a, c and t contain ≈ 100k structures each; the remaining ≈ 175k structures were assigned to the subset b.
  • Figure 3: Effect of quantity of pre-training data on accuracy of fine-tuned model. (a) Mean absolute error (MAE) in CC energy predictions versus the number of CC-labelled fine-tuning structures. Coloured lines connect points corresponding to the same amount of pre-training data: black for no pre-training (i.e., direct CC training), and increasingly dark shades of blue for increasing amounts of DFT-labelled structures used during pre-training. (b) Relationship between the energy MAE of the pre-trained model on the DFT-labelled validation set and the final fine-tuned model's energy MAE on the CC-labelled test set. Each point corresponds to a single run, with larger markers and darker colours indicating more pre-training data. Separate log-log linear best fits are shown for models fine-tuned on 1k (squares) and 10k (triangles) CC data.
  • Figure 4: Effect of pre-training labels on accuracy after fine-tuning. (a) CC test set MAE versus number of CC-labelled fine-tuning structures. Line colour indicates the amount and source of pre-training data: black for no pre-training; increasingly dark shades of blue and red for more DFT- and xTB-labelled structures from split b, respectively. (b) Distribution of energy differences between DFT/xTB and CC labels after accounting for energy offsets $\mu^{\mathcal{M}}$. (c) Pre-trained model error on the relevant DFT/xTB validation set versus fine-tuned model error on the CC test set for models of varying size, pre-trained on various amounts of (DFT/xTB; b) and fine-tuned on 10k (CC; a). Each point is a single run, with larger markers corresponding to more pre-training data. Log-log lines of best fit shown per model size and labelling method. (d) CC test MAE versus additional cost of generating pre-training labels. The cost is relative to that of the 10k fine-tuning labels, assuming xTB:CC and DFT:CC cost ratios of 1:1000 and 1:10, respectively.
  • Figure 5: Effect of model architecture on the benefits of pre-training. (a) CC test MAE for MACE and Allegro models pre-trained on (DFT; b) and fine-tuned on (CC; a), varying the amounts of pre-training and fine-tuning data. Black lines show direct training on (CC; a) for reference. (b) Pre-trained error on the DFT validation set versus fine-tuned error on the CC test set for both architectures, each fine-tuned on either 10k or 100k CC structures. Each point corresponds to an independent training run, with larger markers corresponding to more pre-training data. Log-log lines of best fit are shown for each architecture and amount of fine-tuning data.
  • ...and 13 more figures