Table of Contents
Fetching ...

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

Ryan Liu, Eric Qu, Tobias Kreiman, Samuel M. Blau, Aditi S. Krishnapriyan

TL;DR

The paper addresses the mismatch between MLIP regression accuracy and the true smoothness of quantum PES, which can destabilize MD simulations. It introduces BSCT, a computationally efficient benchmark that probes PES smoothness along controlled bond deformations and defines the Force Smoothness Deviation (FSD) as a fast proxy for MD stability. Through a neutral Transformer-like testbed (MinDScAIP), the authors demonstrate that targeted architectural refinements, including Diff-kNN, controllable Gaussian smearing, and temperature-controlled attention, reduce nonphysical PES features and improve both near- and far-from-equilibrium performance. BSCT is shown to be a practical in-the-loop design proxy that helps MLIP developers identify and mitigate physical challenges not captured by conventional benchmarks, with broader implications for reliable atomistic simulations. The work also provides evidence that combining physics-based evaluation with careful architecture design yields MLIPs that balance accuracy, stability, and scalability.

Abstract

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable $k$-nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks.

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

TL;DR

The paper addresses the mismatch between MLIP regression accuracy and the true smoothness of quantum PES, which can destabilize MD simulations. It introduces BSCT, a computationally efficient benchmark that probes PES smoothness along controlled bond deformations and defines the Force Smoothness Deviation (FSD) as a fast proxy for MD stability. Through a neutral Transformer-like testbed (MinDScAIP), the authors demonstrate that targeted architectural refinements, including Diff-kNN, controllable Gaussian smearing, and temperature-controlled attention, reduce nonphysical PES features and improve both near- and far-from-equilibrium performance. BSCT is shown to be a practical in-the-loop design proxy that helps MLIP developers identify and mitigate physical challenges not captured by conventional benchmarks, with broader implications for reliable atomistic simulations. The work also provides evidence that combining physics-based evaluation with careful architecture design yields MLIPs that balance accuracy, stability, and scalability.

Abstract

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable -nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks.
Paper Structure (45 sections, 11 equations, 8 figures, 7 tables)

This paper contains 45 sections, 11 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Example of how the Bond Smoothness and Characterization Test (BSCT) is constructed. We show a C-C bond in a $C_2H_2F_4$ molecule from the BSCT-SPICE dataset. The DFT reference PES smoothly varies across the wide range of bond lengths, showing the behavior expected from reliable interatomic potentials. These 1D probes are a simple and efficient way to measure PES smoothness in a regime of highly varying energies, which is inherently out-of-distribution of many MLIP training datasets.
  • Figure 2: (a): Motivation for our proposed potential energy surface (PES) smoothness metric. We compare two hypothetical PES. Both PES1 and PES2 accurately reproduce the true quadratic PES near equilibrium. Away from equilibrium, PES1 slowly deviates from the quadratic PES reference but remains smooth, and PES2 has an artificial minimum (non-smoothness) enclosed by gold dashed lines. (b) Standard metrics, such as energy and forces mean absolute errors (MAEs), evaluated on the one-dimensional subset fail to detect PES2's non-smoothness. (c) Our proposed force smoothness deviation (FSD) metric sensitively captures this non-smooth behavior.
  • Figure 3: To study how different model design choices impact or improve PES smoothness, we design a neural network backbone similar to Swin-Transformer, intending to create a neutral testbed for BSCT-guided architecture ablations. We generalize the shifted window attention of Swin Transformer to graphs by alternating in-and-out neighborhood attention. (a) The structural similarity to a generic Transformer is intentional, aiming to provide a neutral testbed for BSCT. (b) The interleaving windows allow information to propagate across the molecular graph.
  • Figure 4: The Diff-kNN algorithm inherits the computational advantage of $k$-NN algorithm while maintaining differentiability by replacing the hard ranking algorithm used in standard kNN with the soft ranking described in Equation \ref{['soft-ranking']}. This allows the architecture to be fully differentiable, and so MLIP force predictions can be computed as the negative gradient of the potential energy.
  • Figure 5: Example of how BSCT can serve as an in-the-loop evaluation for MLIP development. We probe a $C_{11}H_{12}NO_2$ molecule in the BSCT-SPICE dataset, visualizing $\log\left({\Vert \Delta \vec{F}_{\text{MLIP}} \Vert^2} / {\Vert \Delta \vec{F}_{\text{DFT}} \Vert^2}\right)$, whose derivative with respect to $\alpha$ defines FSD. We also visualize the changes in the attention scores from the stretched N-C bond to the N atom along the bond scan, with heads overlaid. The strong correlation in FSD and attention score suggests the need for explicit regularization, motivating the proposed temperature-controlled attention.
  • ...and 3 more figures