When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential

Jason B. Gibson; Tesia D. Janicki; Ajinkya C. Hire; Chris Bishop; J. Matthew D. Lane; Richard G. Hennig

When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential

Jason B. Gibson, Tesia D. Janicki, Ajinkya C. Hire, Chris Bishop, J. Matthew D. Lane, Richard G. Hennig

TL;DR

This study investigates how training data diversity affects the performance of MLIPs using the Ultra-Fast Force Field to model amorphous silicon nitride and finds that the UF$^3$ variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant.

Abstract

Machine-learned interatomic potentials (MLIPs) are becoming an essential tool in materials modeling. However, optimizing the generation of training data used to parameterize the MLIPs remains a significant challenge. This is because MLIPs can fail when encountering local enviroments too different from those present in the training data. The difficulty of determining \textit{a priori} the environments that will be encountered during molecular dynamics (MD) simulation necessitates diverse, high-quality training data. This study investigates how training data diversity affects the performance of MLIPs using the Ultra-Fast Force Field (UF$^3$) to model amorphous silicon nitride. We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP's learning capacity, reducing simulation accuracy. Specifically, we found that the UF$^3$ variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant. By comparing these UF$^3$ variants, we highlight the nuanced requirements for creating accurate MLIPs, emphasizing the importance of application-specific training data to achieve optimal performance in modeling complex material behaviors.

When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential

TL;DR

This study investigates how training data diversity affects the performance of MLIPs using the Ultra-Fast Force Field to model amorphous silicon nitride and finds that the UF

variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant.

Abstract

) to model amorphous silicon nitride. We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP's learning capacity, reducing simulation accuracy. Specifically, we found that the UF

variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant. By comparing these UF

variants, we highlight the nuanced requirements for creating accurate MLIPs, emphasizing the importance of application-specific training data to achieve optimal performance in modeling complex material behaviors.

Paper Structure (3 equations, 4 figures)

This paper contains 3 equations, 4 figures.

Figures (4)

Figure 1: Speed comparison of DFT, GAP, empirical potentials and UF$^3$. The plot shows the speed of DFT, the GAP sin_mlip_1 and UF$^3$ MLIP, and the MG2 mg2, Vashishta vashishta, and Tersoff tersoff empirical potentials. The speed simulations were ran on 4 CPUs for 100 timesteps with a simulation size of 3024 atoms.
Figure 2: Visualization of the 2-body terms for the UF$^3$ variants. The lines represent the learned two-body interaction for each variant for (a) Si-Si, (b) Si-N, and (c) N-N interactions.
Figure 3: Validation force predictions for the $\mathrm{UF}^3$ variants The green markers show the predictions for the stoichiometric test set compared to DFT. The red markers show the predictions for the non-stoichiometric test set compared to DFT. The color gradient shows the density of points. The tables show the regression metrics for the variants on both test sets
Figure 4: Validation simulation results for the $\mathrm{UF}^3$ variants. (a) Shows the performance of elastic properties simulated by the UF$^3$ variants compared to DFT. The solid black circle indicates zero error. The inner and outer circles represent 50% under and over prediction, respectively. The subscript represents the Si$_3$N$_4$ polytype. the number in the spider plots is the mean absolute percent error (MAPE) for all quantities computed compared to DFT. (b,c) Show the structural properties of amorphous Si$_3$N$_4$ at 1700 K for UF$^3$. The initial amorphous structure started from a random configuration of 224 atoms. It was then equilibrated using UF$^3_{\text{final}}$ for 16 ns at 1700 K. This structure was then further equilibrated for an additional 800 ps using either aiMD for the DFT-equilibrated reference or the respective UF$^3$ variant. (b) shows the partial RDF for N-N (bottom), Si-Si (middle), and Si-N (top). The number in the legend is the final density of the amorphous structure. (c) Shows the ADF computed by each variant for N-Si-N (top) and Si-N-Si (bottom) interactions.