Table of Contents
Fetching ...

LaMM: Semi-Supervised Pre-Training of Large-Scale Materials Models

Yosuke Oyama, Yusuke Majima, Eiji Ohta, Yasufumi Sakai

TL;DR

LaMM tackles the challenge of expensive DFT labeling and load imbalance in pre-training neural-network potentials by introducing a semi-supervised pre-training framework on a joint dataset of about $\sim 3\times 10^8$ samples. It combines two backbone models, PaiNN and EquiformerV2, into LaMM-S and LaMM-L, a universal loss for multi-dataset training, a denoising labeling scheme, and a load-balancing strategy to enable scalable, multi-node training. Empirically, LaMM-S and LaMM-L achieve faster fine-tuning and improved accuracy on unseen datasets, with substantial throughput gains ($2.44\times$ to $3.38\times$) and better energy/force predictions on HME21 compared to baselines. This work reduces DFT labeling costs and supports the deployment of universal foundation models for materials discovery and simulation, with potential applicability to diverse downstream tasks and datasets.

Abstract

Neural network potentials (NNPs) are crucial for accelerating computational materials science by surrogating density functional theory (DFT) calculations. Improving their accuracy is possible through pre-training and fine-tuning, where an NNP model is first pre-trained on a large-scale dataset and then fine-tuned on a smaller target dataset. However, this approach is computationally expensive, mainly due to the cost of DFT-based dataset labeling and load imbalances during large-scale pre-training. To address this, we propose LaMM, a semi-supervised pre-training method incorporating improved denoising self-supervised learning and a load-balancing algorithm for efficient multi-node training. We demonstrate that our approach effectively leverages a large-scale dataset of $\sim$300 million semi-labeled samples to train a single NNP model, resulting in improved fine-tuning performance in terms of both speed and accuracy.

LaMM: Semi-Supervised Pre-Training of Large-Scale Materials Models

TL;DR

LaMM tackles the challenge of expensive DFT labeling and load imbalance in pre-training neural-network potentials by introducing a semi-supervised pre-training framework on a joint dataset of about samples. It combines two backbone models, PaiNN and EquiformerV2, into LaMM-S and LaMM-L, a universal loss for multi-dataset training, a denoising labeling scheme, and a load-balancing strategy to enable scalable, multi-node training. Empirically, LaMM-S and LaMM-L achieve faster fine-tuning and improved accuracy on unseen datasets, with substantial throughput gains ( to ) and better energy/force predictions on HME21 compared to baselines. This work reduces DFT labeling costs and supports the deployment of universal foundation models for materials discovery and simulation, with potential applicability to diverse downstream tasks and datasets.

Abstract

Neural network potentials (NNPs) are crucial for accelerating computational materials science by surrogating density functional theory (DFT) calculations. Improving their accuracy is possible through pre-training and fine-tuning, where an NNP model is first pre-trained on a large-scale dataset and then fine-tuned on a smaller target dataset. However, this approach is computationally expensive, mainly due to the cost of DFT-based dataset labeling and load imbalances during large-scale pre-training. To address this, we propose LaMM, a semi-supervised pre-training method incorporating improved denoising self-supervised learning and a load-balancing algorithm for efficient multi-node training. We demonstrate that our approach effectively leverages a large-scale dataset of 300 million semi-labeled samples to train a single NNP model, resulting in improved fine-tuning performance in terms of both speed and accuracy.

Paper Structure

This paper contains 16 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overview of LaMM.
  • Figure 2: Overview of our pre-training dataset. "E": Energy, "F": Forces, "N": Pseudo-force labels. "S2EF-T": S2EF-Total. "IS2RE-T": IS2RE-Total. "TS": Temperature sampling with $T=2$. *: Data samples of more than 300 atoms are excluded from ODAC23, as explained in Section \ref{['subsection:dataset']}.
  • Figure 3: Heatmap of the number of datasets containing each element. Markers represent which subsets contain the corresponding element.
  • Figure 4: Existing and proposed denoising labeling algorithm. (a) Input coordinate $x_i + \Delta x_i$ and labels $-\Delta x_i$ of the existing method. (b) Input coordinate $x_i + \Delta x_i - \overline{\Delta x}$ and labels $-(\Delta x_i - \overline{\Delta x})$ of the proposed method.
  • Figure 5: Histogram of the per-sample number of atoms of our pre-training dataset. Each histogram is kernel density estimated using Gaussian kernels with $10^5$ randomly sampled structures. Each point represents the location where the corresponding histogram is maximized.
  • ...and 5 more figures