LaMM: Semi-Supervised Pre-Training of Large-Scale Materials Models
Yosuke Oyama, Yusuke Majima, Eiji Ohta, Yasufumi Sakai
TL;DR
LaMM tackles the challenge of expensive DFT labeling and load imbalance in pre-training neural-network potentials by introducing a semi-supervised pre-training framework on a joint dataset of about $\sim 3\times 10^8$ samples. It combines two backbone models, PaiNN and EquiformerV2, into LaMM-S and LaMM-L, a universal loss for multi-dataset training, a denoising labeling scheme, and a load-balancing strategy to enable scalable, multi-node training. Empirically, LaMM-S and LaMM-L achieve faster fine-tuning and improved accuracy on unseen datasets, with substantial throughput gains ($2.44\times$ to $3.38\times$) and better energy/force predictions on HME21 compared to baselines. This work reduces DFT labeling costs and supports the deployment of universal foundation models for materials discovery and simulation, with potential applicability to diverse downstream tasks and datasets.
Abstract
Neural network potentials (NNPs) are crucial for accelerating computational materials science by surrogating density functional theory (DFT) calculations. Improving their accuracy is possible through pre-training and fine-tuning, where an NNP model is first pre-trained on a large-scale dataset and then fine-tuned on a smaller target dataset. However, this approach is computationally expensive, mainly due to the cost of DFT-based dataset labeling and load imbalances during large-scale pre-training. To address this, we propose LaMM, a semi-supervised pre-training method incorporating improved denoising self-supervised learning and a load-balancing algorithm for efficient multi-node training. We demonstrate that our approach effectively leverages a large-scale dataset of $\sim$300 million semi-labeled samples to train a single NNP model, resulting in improved fine-tuning performance in terms of both speed and accuracy.
