Table of Contents
Fetching ...

Scalable Data-Driven Basis Selection for Linear Machine Learning Interatomic Potentials

Tina Torabi, Matthias Militzer, Michael P. Friedlander, Christoph Ortner

TL;DR

This work tackles the challenge of feature selection in linear machine-learning interatomic potentials by integrating active-set sparse recovery into the Atomic Cluster Expansion (ACE) framework. By applying ASP and OMP, the authors automatically identify a compact, informative subset of basis functions, producing model-paths that navigate the trade-off between cost and accuracy without extensive hyperparameter tuning. Across limited-diversity metals, elemental silicon, and liquid water benchmarks, sparse solvers consistently match or exceed the performance of dense baselines while using far fewer basis functions, demonstrating improved generalization and interpretability. The approach enables scalable, robust interatomic potentials suitable for long-time molecular dynamics simulations, with reduced manual intervention and clearer insight into the physically relevant interactions.

Abstract

Machine learning interatomic potentials (MLIPs) provide an effective approach for accurately and efficiently modeling atomic interactions, expanding the capabilities of atomistic simulations to complex systems. However, a priori feature selection leads to high complexity, which can be detrimental to both computational cost and generalization, resulting in a need for hyperparameter tuning. We demonstrate the benefits of active set algorithms for automated data-driven feature selection. The proposed methods are implemented within the Atomic Cluster Expansion (ACE) framework. Computational tests conducted on a variety of benchmark datasets indicate that sparse ACE models consistently enhance computational efficiency, generalization accuracy and interpretability over dense ACE models. An added benefit of the proposed algorithms is that they produce entire paths of models with varying cost/accuracy ratio.

Scalable Data-Driven Basis Selection for Linear Machine Learning Interatomic Potentials

TL;DR

This work tackles the challenge of feature selection in linear machine-learning interatomic potentials by integrating active-set sparse recovery into the Atomic Cluster Expansion (ACE) framework. By applying ASP and OMP, the authors automatically identify a compact, informative subset of basis functions, producing model-paths that navigate the trade-off between cost and accuracy without extensive hyperparameter tuning. Across limited-diversity metals, elemental silicon, and liquid water benchmarks, sparse solvers consistently match or exceed the performance of dense baselines while using far fewer basis functions, demonstrating improved generalization and interpretability. The approach enables scalable, robust interatomic potentials suitable for long-time molecular dynamics simulations, with reduced manual intervention and clearer insight into the physically relevant interactions.

Abstract

Machine learning interatomic potentials (MLIPs) provide an effective approach for accurately and efficiently modeling atomic interactions, expanding the capabilities of atomistic simulations to complex systems. However, a priori feature selection leads to high complexity, which can be detrimental to both computational cost and generalization, resulting in a need for hyperparameter tuning. We demonstrate the benefits of active set algorithms for automated data-driven feature selection. The proposed methods are implemented within the Atomic Cluster Expansion (ACE) framework. Computational tests conducted on a variety of benchmark datasets indicate that sparse ACE models consistently enhance computational efficiency, generalization accuracy and interpretability over dense ACE models. An added benefit of the proposed algorithms is that they produce entire paths of models with varying cost/accuracy ratio.

Paper Structure

This paper contains 18 sections, 48 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Energy MAE vs basis size for selected limited diversity datasets (cf. \ref{['sec:limit_metal']}), comparing three sparse least squares solvers (ARD, ASP, OMP) with a direct regularized least squares approach (RRQR).
  • Figure 2: Visualization of the basis functions selected for two-body interactions in the Mo dataset (cf. \ref{['sec:limit_metal']}). The figure illustrates the gradual selection process of ASP and non-sparse ACE solvers as the number of active basis functions increases. The ASP-selected basis functions (pink) show a distinct, data-driven selection pattern compared to the full ACE basis (black), demonstrating the benefits of data-driven basis selection.
  • Figure 3: Visualization of the basis functions selected for three-body interactions in the Mo dataset (cf. \ref{['sec:limit_metal']}). The figure illustrates the gradual selection process of ASP and non-sparse ACE solvers as the number of active basis functions increases. The selection is visualized in 3D, with colors indicating the distance from the origin. Similarly as with 2-correlation, ASP selects features without any a priori predictable pattern.
  • Figure 4: Energy MAE vs. basis size for the Silicon dataset PhysRevX.8.041048, comparing OMP, BLR, and ASP (cf. \ref{['sec:silicon18']}). In the mid-to-high basis size regime, ASP and OMP achieve similar or better accuracy than BLR while using significantly fewer basis functions.
  • Figure 5: Percentage error relative to the computed values in Table \ref{['tab:Sicomparison']} for various silicon properties using GAP, BLR, and ASP (cf. \ref{['sec:silicon18']}). The plot illustrates how the relative error of ASP decreases as the number of active basis functions increases, demonstrating a clear trend of improved accuracy.
  • ...and 2 more figures